Efficiently get the number of children with specif

2019-02-20 13:23发布

问题:

Using R and the package XML I'm parsing huge XML files. As part of the data handling I need to now, in a long list of nodes, how many children of specific name each node has (the number of nodes can exceed 20.000)

My approach at the moment is:

nChildrenWithName <- xpathSApply(doc, path="/path/to/node/*", namespaces=ns, xmlName) == 'NAME'
nChildren <- xpathSApply(doc, path="/path/to/node", namespaces=ns, fun=xmlSize)
nID <- sapply(split(nChildrenWithName, rep(seq(along=nChildren), nChildren)), sum)

Which is as vectorized as I can get it. Still I have the feeling that this can be achieved in a single call using the correct XPATH expression. My knowledge on XPATH is limited though, so if anyone knows how to do it I would be grateful for some insight...

best Thomas

回答1:

If I understand correctly the question, there is a XML like:

<path>
  <to>
    <node>
      <NAME>A</NAME>
      <NAME>B</NAME>
      <NAME>C</NAME>
    </node>
    <node>
      <NAME>X</NAME>
      <NAME>Y</NAME>
    </node>
  </to>
  <to>
    <node>
      <NAME>AA</NAME>
      <NAME>BB</NAME>
      <NAME>CC</NAME>
    </node>
  </to>
</path>

and what is wanted is the number of NAME elements under each node one - so 3, 2, 3 in the example above.

This is not possible in XPath 1.0: an expression can return a list of nodes or a single value - but not a list of computed values.

Using XPath 2.0 you can write:

for $node in /path/to/node return count($node/NAME)

or simply:

/path/to/node/count(NAME)

(You can test them here)



回答2:

library(XML)
doc <- xmlTreeParse(
  system.file("exampleData", "mtcars.xml", package="XML"),
  useInternalNodes=TRUE      )
xpathApply(xmlRoot(doc),path="count(//variable)",xmlValue)


回答3:

Considering the example mentioned by MiMo

<path>
  <to>
    <node>
      <NAME>A</NAME>
      <NAME>B</NAME>
      <NAME>C</NAME>
    </node>
    <node>
      <NAME>X</NAME>
      <NAME>Y</NAME>
    </node>
  </to>
  <to>
    <node>
      <NAME>AA</NAME>
      <NAME>BB</NAME>
      <NAME>CC</NAME>
    </node>
  </to>
</path>

To get number of children under /path/to/node

library(XML)
doc = xmlParse("filename", useInternalNodes = TRUE)
rootNode = xmlRoot(doc)
childnodes = xpathSApply(rootNode[[1]][[1]], ".//NAME", xmlChildren)
length(childnodes)
[1] 3

It will give you 3, similarly to get number of children under second node just pass the index accordingly,

childnodes = xpathSApply(rootNode[[1]][[2]], ".//NAME", xmlChildren)
length(childnodes)
[1] 2

I hope it will help you.