R dataframe from XML when values are multiple or m

2019-01-23 21:57发布

问题:

This question is similar to a previous question, Import all fields (and subfields) of XML as dataframe, but I want to pull out only a subset of the XML data and want to include missing/multiple values.

I start with an XML file and want to construct a dataframe in R based on some of the data it contains, defined by the contents of XML elements. It is easiest to explain with an example. In the below, I want to pick out the information about landmarks for every city (even if there is no landmark element or there are several) and ignore the information about stations.

<world>
    <city>
        <name>London</name>
        <buildings>
            <building>
                <type>landmark</type>
                <bname>Tower Bridge</bname>
            </building>
            <building>
                <type>station</type>
                <bname>Waterloo</bname>
            </building>
        </buildings>
    </city>
    <city>
        <name>New York</name>
        <buildings>
            <building>
                <type>station</type>
                <bname>Grand Central</bname>
            </building>
        </buildings>
    </city>
    <city>
        <name>Paris</name>
        <buildings>
            <building>
                <type>landmark</type>
                <bname>Eiffel Tower</bname>
            </building>
            <building>
                <type>landmark</type>
                <bname>Louvre</bname>
            </building>
        </buildings>
    </city>
</world>

Ideally this would go into a dataframe that looks something like this:

 London      Tower Bridge
 New York    NA
 Paris       Eiffel Tower
 Paris       Louvre

I assumed there might be a way to do this using the XML library and xpathSApply but I think I'm beaten.

Also couldn't think how to phrase the question without just referring to the example so feel free to edit to give a more descriptive question.

回答1:

Assuming the XML data is in a file called world.xml read it in and iterate over the cities extracting the city name and the bname of any associated landmarks :

library(XML)
doc <- xmlParse("world.xml", useInternalNodes = TRUE)

do.call(rbind, xpathApply(doc, "/world/city", function(node) {

   city <- xmlValue(node[["name"]])

   xp <- "./buildings/building[./type/text()='landmark']/bname"
   landmark <- xpathSApply(node, xp, xmlValue)
   if (is.null(landmark)) landmark <- NA

   data.frame(city, landmark, stringsAsFactors = FALSE)

}))

The result is:

      city     landmark
1   London Tower Bridge
2 New York         <NA>
3    Paris Eiffel Tower
4    Paris       Louvre


回答2:

You can use xmlToList and then plyr to get a dataframe you can use

require(XML)
require(plyr)
xD <- xmlParse(xData)
xL <- xmlToList(xD)
ldply(xL, data.frame)
> ldply(xL, data.frame)
   .id     name buildings.building.type buildings.building.bname
1 city   London                landmark             Tower Bridge
2 city New York                 station            Grand Central
3 city    Paris                landmark             Eiffel Tower
  buildings.building.type.1 buildings.building.bname.1
1                   station                   Waterloo
2                      <NA>                       <NA>
3                  landmark                     Louvre

You can pick what you need from this dataframe



回答3:

There is a solution xpathSapply but writing the xpath here is a little bit complicated. So, Here I propose a solution using xmlToDataFrame and using some regular expression to get the buildings.

dd <- xmlToDataFrame(doc)
rr <- gsub('landmark',',',dd$buildings)
rr <- gsub('station.*','',rr)
builds <- lapply(strsplit(gsub('station.*','',rr),','),
                 function(x)x[nchar(x)>0])
dd$buildings <- builds

    name            buildings
1   London         Tower Bridge
2 New York                     
3    Paris Eiffel Tower, Louvre


回答4:

If you're looking to exactly reproduce the desired output you showed in your question, you can convert your XML to a list and then extract the information you want:

xml_list <- xmlToList(xmlParse(xml_data))

First loop through each "building" node and remove those that contain "station":

xml_list <- lapply(xml_list, lapply, function(x) {
  x[!sapply(x, function(y) any(y == "station"))]
})

Then collect data for each city into a data frame

xml_list <- lapply(xml_list, function(x) {
  bldgs <- unlist(x$buildings)
  bldgs <- bldgs[bldgs != "landmark"]
  if(is.null(bldgs)) bldgs <- NA
  data.frame(
    "city" = x$name,
    "landmark" = bldgs,
    stringsAsFactors = FALSE)
})

Then combine information from all cities together:

xml_output <- do.call("rbind", xml_list)
xml_output
           city     landmark
city     London Tower Bridge
city1  New York         <NA>
city.1    Paris Eiffel Tower
city.2    Paris       Louvre