可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This question is similar to a previous question, Import all fields (and subfields) of XML as dataframe, but I want to pull out only a subset of the XML data and want to include missing/multiple values.
I start with an XML file and want to construct a dataframe in R based on some of the data it contains, defined by the contents of XML elements. It is easiest to explain with an example. In the below, I want to pick out the information about landmarks for every city (even if there is no landmark element or there are several) and ignore the information about stations.
<world>
<city>
<name>London</name>
<buildings>
<building>
<type>landmark</type>
<bname>Tower Bridge</bname>
</building>
<building>
<type>station</type>
<bname>Waterloo</bname>
</building>
</buildings>
</city>
<city>
<name>New York</name>
<buildings>
<building>
<type>station</type>
<bname>Grand Central</bname>
</building>
</buildings>
</city>
<city>
<name>Paris</name>
<buildings>
<building>
<type>landmark</type>
<bname>Eiffel Tower</bname>
</building>
<building>
<type>landmark</type>
<bname>Louvre</bname>
</building>
</buildings>
</city>
</world>
Ideally this would go into a dataframe that looks something like this:
London Tower Bridge
New York NA
Paris Eiffel Tower
Paris Louvre
I assumed there might be a way to do this using the XML library and xpathSApply
but I think I'm beaten.
Also couldn't think how to phrase the question without just referring to the example so feel free to edit to give a more descriptive question.
回答1:
Assuming the XML data is in a file called world.xml
read it in and iterate over the cities extracting the city name
and the bname
of any associated landmarks :
library(XML)
doc <- xmlParse("world.xml", useInternalNodes = TRUE)
do.call(rbind, xpathApply(doc, "/world/city", function(node) {
city <- xmlValue(node[["name"]])
xp <- "./buildings/building[./type/text()='landmark']/bname"
landmark <- xpathSApply(node, xp, xmlValue)
if (is.null(landmark)) landmark <- NA
data.frame(city, landmark, stringsAsFactors = FALSE)
}))
The result is:
city landmark
1 London Tower Bridge
2 New York <NA>
3 Paris Eiffel Tower
4 Paris Louvre
回答2:
You can use xmlToList
and then plyr
to get a dataframe you can use
require(XML)
require(plyr)
xD <- xmlParse(xData)
xL <- xmlToList(xD)
ldply(xL, data.frame)
> ldply(xL, data.frame)
.id name buildings.building.type buildings.building.bname
1 city London landmark Tower Bridge
2 city New York station Grand Central
3 city Paris landmark Eiffel Tower
buildings.building.type.1 buildings.building.bname.1
1 station Waterloo
2 <NA> <NA>
3 landmark Louvre
You can pick what you need from this dataframe
回答3:
There is a solution xpathSapply
but writing the xpath here is a little bit complicated.
So, Here I propose a solution using xmlToDataFrame
and using some regular expression to get the buildings.
dd <- xmlToDataFrame(doc)
rr <- gsub('landmark',',',dd$buildings)
rr <- gsub('station.*','',rr)
builds <- lapply(strsplit(gsub('station.*','',rr),','),
function(x)x[nchar(x)>0])
dd$buildings <- builds
name buildings
1 London Tower Bridge
2 New York
3 Paris Eiffel Tower, Louvre
回答4:
If you're looking to exactly reproduce the desired output you showed in your question, you can convert your XML to a list and then extract the information you want:
xml_list <- xmlToList(xmlParse(xml_data))
First loop through each "building" node and remove those that contain "station":
xml_list <- lapply(xml_list, lapply, function(x) {
x[!sapply(x, function(y) any(y == "station"))]
})
Then collect data for each city into a data frame
xml_list <- lapply(xml_list, function(x) {
bldgs <- unlist(x$buildings)
bldgs <- bldgs[bldgs != "landmark"]
if(is.null(bldgs)) bldgs <- NA
data.frame(
"city" = x$name,
"landmark" = bldgs,
stringsAsFactors = FALSE)
})
Then combine information from all cities together:
xml_output <- do.call("rbind", xml_list)
xml_output
city landmark
city London Tower Bridge
city1 New York <NA>
city.1 Paris Eiffel Tower
city.2 Paris Louvre