Parsing XML in R: Incorrect namespaces

2019-04-28 12:12发布

问题:

I have a bunch of XML files and an R script that reads their content into a data frame. However, I got now files which I wanted to parse as usual, but there is something in their namespace definition that doesn't allow me to pick their values normally with XPath expressions.

XML files are like this:

xml_nons.xml

<?xml version="1.0" encoding="UTF-8"?>
<XML>
   <Node>
      <Name>Name 1</Name>
      <Title>Title 1</Title>
      <Date>2015</Date>
   </Node>
</XML>

And the other:

xml_ns.xml

<?xml version="1.0" encoding="UTF-8"?>
<XML xmlns="http://www.nonexistingsite.com">
   <Node>
      <Name>Name 2</Name>
      <Title>Title 2</Title>
      <Date>2014</Date>
   </Node>
</XML>

The URL where xmlns points to doesn't exist.

The R code I use is like this:

library(XML)

xmlfiles <- list.files(path = ".", 
                       pattern="*.xml$", 
                       full.names = TRUE, 
                       recursive = TRUE)

n <- length(xmlfiles)
dat <- vector("list", n)

for(i in 1:n){
       doc <- xmlTreeParse(xmlfiles[i], useInternalNodes = TRUE)
       nodes <- getNodeSet(doc, "//XML")
       x <- lapply(nodes, function(x){ data.frame(
              Filename = xmlfiles[i],
              Name = xpathSApply(x, ".//Node/Name" , xmlValue),
              Title = xpathSApply(x, ".//Node/Title" , xmlValue),
              Date = xpathSApply(x, ".//Node/Date" , xmlValue)
            )})
            dat[[i]] <- do.call("rbind", x)
    }

    xml <- do.call("rbind", dat)
    xml

However, what I get as a result is:

Filename            Name    Title    Date
./xml_nons.xml      Name 1  Title 1  2015

If I remove the namespace link from the second file I get correct:

Filename            Name    Title    Date
./xml_nons_1.xml    Name 1  Title 1  2015
./xml_ns_1.xml      Name 2  Title 2  2014

Of course I could have an XSL to remove those namespaces from original XML files, but I would like to have some solution that works within R. Is there some way to tell R just to ignore everything in the XML declaration?

回答1:

I think there is no easy way to ignore the namespaces. The best way is to learn to live with them. This answer will use the newer XML2 package. But the same applies to the XML package solution.

Use

library(XML2)
fname='myfile.xml'
doc <- read_xml(fname)
#peak at the namespaces
xml_ns(doc)

The first namespace is assigned to d1. If you XPath does not find what you want, the most likely cause is the namespace issue.

xpath <-  "//d1:FormDef"
ns <- xml_find_all(doc,xpath, xml_ns(doc))
ns

Also, you have to do this for every element in the path So to save typing, you can do

library(stringr)
> xpath <-  "/ODM/Study"
> (xpath<-str_replace_all(xpath,'/','/d1:'))
[1] "/d1:ODM/d1:Study"