R: Insert node into xml tree at specific location

2019-05-05 11:02发布

Data

I have an xml file with a structure like this (large example to show the needed flexibility):

<rootnode sth="something" descr="ex">
  <tag sth="sth1" descr="ex" anoAttr="sth2">
    <tag sth="sth3" descr="ex2" searchA="sth4" anoAttr="sth5">
      <tag sth="sth6" descr="ex3" oAttr="sth7" searchA="sth8" anoAttr="sth9">
        <tag sth="sth10" descr="ex4" oAttr="sth11" searchA="sth12" anoAttr="sth13">
          <someContent/>
        </tag>
        <someContent/>
      </tag>
      <tag sth="sth14" descr="ex5" oAttr="sth15" searchA="sth16" anoAttr="sth17">
        <someContent/>
      </tag>
      <tag sth="sth1" descr="ex6" oAttr="sth15" searchA="sth18" anoAttr="sth17">
        <someContent/>
      </tag>
    </tag>
    <tag sth="sth10" descr="ex2" oAttr="sth19" searchA="sth20" anoAttr="sth9">
      <someContent/>
    </tag>
    <tag sth="sth10" descr="ex7" searchA="sth21" anoAttr="sth13">
      <tag sth="sth21" descr="ex8" oAttr="sth22" searchA="sth23" anoAttr="sth9">
        <tag sth="sth23" descr="ex9" oAttr="sth22" searchA="sth24" anoAttr="sth5">
          <someContent/>
        </tag>
        <someContent/>
      </tag>
    </tag>
  </tag>
  <otherNode>
    <someNode/>
  </otherNode>
</rootnode>

Specifically, the size of any of the tag nodes is unknown, the number of attributes is not equal for all tag nodes and the values of the attributes are not unique.
What I do know, however, is that the value of the searchA attribute is unique. Also, only tag nodes can contain an attribute called searchA and all of them except the top level one do.

Before

I first parse this document using the XML package with the function xmlTreeParse() and store the root node. I then create a new node using newXMLNode().

xmlfile = xmlTreeParse(filename, useInternalNodes = TRUE)
xmltop = xmlRoot(xmlfile)
newNode = newXMLNode(name = "newlyCreatedNode")

Goal

My goal is to insert my newly created newNode as a child of the node that has a certain value (for example "sth23") as the searchA attribute.
So in this case I want the result to look like this (notice the <newlyCreatedNode/> near the bottom):

<rootnode sth="something" descr="ex">
  <tag sth="sth1" descr="ex" anoAttr="sth2">
    <tag sth="sth3" descr="ex2" searchA="sth4" anoAttr="sth5">
      <tag sth="sth6" descr="ex3" oAttr="sth7" searchA="sth8" anoAttr="sth9">
        <tag sth="sth10" descr="ex4" oAttr="sth11" searchA="sth12" anoAttr="sth13">
          <someContent/>
        </tag>
        <someContent/>
      </tag>
      <tag sth="sth14" descr="ex5" oAttr="sth15" searchA="sth16" anoAttr="sth17">
        <someContent/>
      </tag>
      <tag sth="sth1" descr="ex6" oAttr="sth15" searchA="sth18" anoAttr="sth17">
        <someContent/>
      </tag>
    </tag>
    <tag sth="sth10" descr="ex2" oAttr="sth19" searchA="sth20" anoAttr="sth9">
      <someContent/>
    </tag>
    <tag sth="sth10" descr="ex7" searchA="sth21" anoAttr="sth13">
      <tag sth="sth21" descr="ex8" oAttr="sth22" searchA="sth23" anoAttr="sth9">
        <tag sth="sth23" descr="ex9" oAttr="sth22" searchA="sth24" anoAttr="sth5">
          <someContent/>
        </tag>
        <someContent/>
        <newlyCreatedNode/>
      </tag>
    </tag>
  </tag>
  <otherNode>
    <someNode/>
  </otherNode>
</rootnode>

Basically, in this case addChildren(xmltop[[1]][[3]][[1]], kids = list(newNode)) gets me the result that I want. Of course I do not want to specify [[1]][[3]][[1]].

What I tried

I can get a list of all relevant nodes with xmlElementsByTagName() and get all attributes with xmlAttrs(). I can even get a logical index vector which gives me the correct location.

listOfNodes = xmlElementsByTagName(el = xmltop, "tag", recursive = T)
attributeList = lapply(listOfNodes, FUN = function(x) xmlAttrs(x))
indexVector = sapply(attributeList, FUN = function(x) x["searchA"] == "sth23")
indexVector[is.na(indexVector)] = FALSE
listOfNodes[indexVector]

What I do not know is how to use this information to insert my node into the tree at the correct location.
listOfNodes[indexVector] gives me the correct node, but it is now a list and not a node I can use addChildren() on.
Even if I somehow managed to map the indexVector and the xmlSize() of all nodes to the correct indices that I could use on xmltop directly, I would still have the problem of a variable number of double brackets (xmltop[[1]][[3]] vs xmltop[[1]][[2]][[1]]).

I have also tried several other functions of the XML package, including xmlApply, getNodeLocation and getNodeSet, but they did not seem to help.

What I have not really tried

I do not really understand the difference of xmlTreeParse(), xmlInternalTreeParse() and xmlTreeParse(useInternalNodes = T) and I cannot wrap my head around XPath, so I did not get very far trying to use it.

Any helpful pointers would be much appreciated.

1条回答
一纸荒年 Trace。
2楼-- · 2019-05-05 11:42

The reason for my confusion was the help page for ?xmlElementsByTagName. It says there:

"The addition of the recursive argument makes this function behave like the getElementsByTagName in other language APIs such as Java, C\#. However, one should be careful to understand that in those languages, one would get back a set of node objects. These nodes have references to their parents and children. Therefore one can navigate the tree from each node, find its relations, etc. In the current version of this package (and for the forseeable future), the node set is a “copy” of the nodes in the original tree. And these have no facilities for finding their siblings or parent."

This made me think that the function returns a list of copies instead of references to the nodes themselves.
This might possibly be the case if the xml was parsed with the flag useInternalNodes of the xmlTreeParse() function set to FALSE, but if it is set to TRUE when parsing, the list returned by xmlElementsByTagName() seems to contain the actual references.
These can easily be manipulated using for example addChildren().

In short, the very simple solution to my problem is:

addChildren(listOfNodes[indexVector], kids = list(newNode))
查看更多
登录 后发表回答