I have many XML files (around 100,000) which all look like the following. Each file has around 100 point nodes. I only show five of them for illustration.
<?xml version="1.0" encoding="UTF-8"?>
-<car id="id1">
<point time="1272686841" lon="-122.40648" lat="37.79778" status="E" unit="id1"/>
<point time="1272686781" lon="-122.40544" lat="37.79714" status="M" unit="id1"/>
<point time="1272686722" lon="-122.40714" lat="37.79774" status="M" unit="id1"/>
<point time="1272686661" lon="-122.40704" lat="37.7976" status="M" unit="id1"/>
<point time="1272686619" lon="-122.40616" lat="37.79698" status="E" unit="id1"/>
</car>
I want to merge all these XML files into one big data frame (with about 100,000x100=10,000,000 rows) in R with five columns (time, lon, lat, unit, status). All files have the same five variables but they may be in different order.
The following is my code. I first create five vectors to save these five variables. Then I go to each file, read the entries one by one.
setwd("C:\\Users\\MyName\\Desktop\\XMLTest")
all.files <- list.files()
n <- 2000000
all.lon <- rep(NA, n)
all.lat <- rep(NA, n)
all.time <- rep(NA, n)
all.status <- rep(NA, n)
all.unit <- rep(NA, n)
i <- 1
for (cur.file in all.files) {
if (tolower(file_ext(cur.file)) == "xml") {
xmlfile <- xmlTreeParse(cur.file)
xmltop <- xmlRoot(xmlfile)
for (j in 1:length(xmltop)) {
cur.node <- xmltop[[j]]
cur.lon <- as.numeric(xmlGetAttr(cur.node, "lon"))
cur.lat <- as.numeric(xmlGetAttr(cur.node, "lat"))
cur.time <- as.numeric(xmlGetAttr(cur.node, "time"))
cur.unit <- xmlGetAttr(cur.node, "unit")
cur.status <- xmlGetAttr(cur.node, "status")
all.lon[i] <- cur.lon
all.lat[i] <- cur.lat
all.time[i] <- cur.time
all.status[i] <- cur.status
all.unit[i] <- cur.unit
i <- i + 1
}
}
}
I am new to XML so this is the best I can do now. The problem is it is very slow. One reason is there are so many files. Another reason is the for
loop for (j in 1:length(xmltop))
to read the entries. I tried xmlToDataFrame
but it is not working.
> xmlToDataFrame(cur.file)
Error in matrix(vals, length(nfields), byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
Is there some way to speed up this process?
Consider an
lapply()
solution which may speed up the file iteration. And because all data resides in attributes, you can use XML'sxPathSApply()
in one call.Here is a solution that should work using the xml2 package. I built a function which takes a filename and then process out the 5 attributes which you mentioned above. The comments should clarify the workings of the script.
Since it order of the attributes may not be in the same order each, attribute is retrieved with its own call to xml_attr. If the order is consistent then using the xml_attrs function is a one step solution.