Merge many XML files into one data frame in R

2020-07-17 15:21发布

I have many XML files (around 100,000) which all look like the following. Each file has around 100 point nodes. I only show five of them for illustration.

<?xml version="1.0" encoding="UTF-8"?>
-<car id="id1">
<point time="1272686841" lon="-122.40648" lat="37.79778" status="E" unit="id1"/>
<point time="1272686781" lon="-122.40544" lat="37.79714" status="M" unit="id1"/>
<point time="1272686722" lon="-122.40714" lat="37.79774" status="M" unit="id1"/>
<point time="1272686661" lon="-122.40704" lat="37.7976" status="M" unit="id1"/>
<point time="1272686619" lon="-122.40616" lat="37.79698" status="E" unit="id1"/>
</car>

I want to merge all these XML files into one big data frame (with about 100,000x100=10,000,000 rows) in R with five columns (time, lon, lat, unit, status). All files have the same five variables but they may be in different order.

The following is my code. I first create five vectors to save these five variables. Then I go to each file, read the entries one by one.

setwd("C:\\Users\\MyName\\Desktop\\XMLTest")
all.files <- list.files()
n <- 2000000
all.lon <- rep(NA, n)
all.lat <- rep(NA, n)
all.time <- rep(NA, n)
all.status <- rep(NA, n)
all.unit <- rep(NA, n)
i <- 1
for (cur.file in all.files) {
  if (tolower(file_ext(cur.file)) == "xml") {
    xmlfile <- xmlTreeParse(cur.file)
    xmltop <- xmlRoot(xmlfile)
    for (j in 1:length(xmltop)) {
      cur.node <- xmltop[[j]]
      cur.lon <- as.numeric(xmlGetAttr(cur.node, "lon"))
      cur.lat <- as.numeric(xmlGetAttr(cur.node, "lat"))
      cur.time <- as.numeric(xmlGetAttr(cur.node, "time"))
      cur.unit <- xmlGetAttr(cur.node, "unit")
      cur.status <- xmlGetAttr(cur.node, "status")
      all.lon[i] <- cur.lon
      all.lat[i] <- cur.lat
      all.time[i] <- cur.time
      all.status[i] <- cur.status
      all.unit[i] <- cur.unit
      i <- i + 1
    }
  }
}

I am new to XML so this is the best I can do now. The problem is it is very slow. One reason is there are so many files. Another reason is the for loop for (j in 1:length(xmltop)) to read the entries. I tried xmlToDataFrame but it is not working.

> xmlToDataFrame(cur.file)
Error in matrix(vals, length(nfields), byrow = TRUE) : 
  'data' must be of a vector type, was 'NULL'

Is there some way to speed up this process?

标签: r xml
2条回答
三岁会撩人
2楼-- · 2020-07-17 15:30

Consider an lapply() solution which may speed up the file iteration. And because all data resides in attributes, you can use XML's xPathSApply() in one call.

library(XML)

setwd("C:\\Users\\MyName\\Desktop\\XMLTest")
all.files <- list.files(pattern="\\.xml", path=getwd(), full.names=TRUE)

dfList <- lapply(all.files, function(x){
  xml <- xmlParse(x)
  pointAttribs <- xpathSApply(doc=xml, path="//point",  xmlAttrs)
  # TRANSPOSE XPATH LIST TO DF 
  df <- data.frame(t(pointAttribs))
  # CONVERT TO NUMERIC
  df[c('time', 'lon', 'lat')] <- sapply(df[c('time', 'lon', 'lat')], 
                                        function(x) as.numeric(as.character(x)))
  return(df)
})

df <- do.call(rbind, dfList)
df
#          time       lon      lat status unit
# 1  1272686841 -122.4065 37.79778      E  id1
# 2  1272686781 -122.4054 37.79714      M  id1
# 3  1272686722 -122.4071 37.79774      M  id1
# 4  1272686661 -122.4070 37.79760      M  id1
# 5  1272686619 -122.4062 37.79698      E  id1
...
查看更多
家丑人穷心不美
3楼-- · 2020-07-17 15:46

Here is a solution that should work using the xml2 package. I built a function which takes a filename and then process out the 5 attributes which you mentioned above. The comments should clarify the workings of the script.

library(tools)
library(xml2)
#page<-read_xml('<?xml version="1.0" encoding="UTF-8"?>
#               <car id="id1">
#               <point time="1272686841" lon="-122.40648" lat="37.79778" status="E" unit="id1"/>
#               <point time="1272686781" lon="-122.40544" lat="37.79714" status="M" unit="id1"/>
#               <point time="1272686722" lon="-122.40714" lat="37.79774" status="M" unit="id1"/>
#               <point time="1272686661" lon="-122.40704" lat="37.7976" status="M" unit="id1"/>
#               <point lon="-122.40616" time="1272686619"  lat="37.79698" status="E" unit="id1"/>
#               </car>')

readfile<-function(nextfile) {
  #read files and extract the desired nodes
  page<-read_xml(nextfile)
  nodes<-xml_find_all(page, "point")
  #results<-xml_attrs(nodes)  #test list if all attrs are in the same order

  time<-xml_attr(nodes, "time")
  lon<-xml_attr(nodes, "lon")
  lat<-xml_attr(nodes, "lat")
  status<-xml_attr(nodes, "status")
  unit<-xml_attr(nodes, "unit")
  df<-data.frame(time, lon, lat, status, unit)
}

#get list of files and filter out xml files
all.files <- list.files()
all.files<- all.files[tolower(file_ext(all.files)) == "xml"]
#call function, returns list of data frames then merge together
results<-lapply(all.files, readfile)
do.call(rbind, results)

Since it order of the attributes may not be in the same order each, attribute is retrieved with its own call to xml_attr. If the order is consistent then using the xml_attrs function is a one step solution.

查看更多
登录 后发表回答