I am trying to convert an xml file from the US Federal Register archive into a data frame where each row corresponds to a particular action (e.g., Notice, Rule, Proposed Rule) and each column contains an attribute related to that action (e.g., Agency Type, Subject, etc). I have tried the following:
> setwd("C:/Users/mwilliamson/Desktop/FedReg/2000/01/")
> url = "FR-2000-01-18.xml"
> doc <- xmlInternalTreeParse("FR-2000-01-18.xml")
> doc_list <- xmlToList(doc)
> library(plyr)
> j <- ldply(doc_list, data.frame)
However, it returns an error:
Error in data.frame(SECTNO = "§ 831.502", SUBJECT = "Automatic separation;
exemption.", :
arguments imply differing number of rows: 1, 0
It appears that the number of blank values and the differences in the length of the variables is creating an issue as R processes the XML (I may be wrong here, not much experience with the xml package). I thought it might be possible to use the schema (.xsd) file to avoid this, but it is not clear how I use the schema with xmlToList. Essentially, I am looking for the "best" way to process the xml into the data frame I described and fill any blank cells with NA. I have uploaded a the schema and a sample file to:
https://www.dropbox.com/sh/pluje12t185w1v2/ys1xHzilQO
Any help you can provide would be great!!
UPDATE: I have also tried:
xmlToDataFrame(doc, colClasses = character, homogeneous = NA)
but receive the following:
Error: duplicate subscripts for columns
Again, many thanks for any help you might offer.
UPDATE: It appears that the /AGENCY node is where the data begins to actually fit the format I am attempting to create; however, I can't seem to extract all of the rest of the data (i.e., I can get a single column with 115 records identifying the agency, but can't get the rest of the information related to those 115 records). I have tried the following:
out <- getNodeSet(doc, "//*", fun=xmlToList)
df <- data.frame(do.call(rbind, out))
head(df)
but it seems to cause R to crash. I am hoping that my continued updates will inspire someone to lend a hand. Thanks again for any help you can give.
This XML is a mess and my guess is that you'll need to parse each action separately.
table(xpathSApply(doc, "//FEDREG/child::node()", xmlName))
DATE NEWPART NO NOTICES PRESDOCS PRORULES RULES UNITNAME VOL
12 6 12 1 3 1 1 12 12
table(xpathSApply(doc, "//NOTICES/child::node()", xmlName))
NOTICE
92
Get notices using getNodeSet
z <- getNodeSet(doc, "//NOTICE")
z[[1]]
# check node names
sapply(z, xmlSApply, xmlName)
x <- xmlToDataFrame(z)
dim(x)
[1] 92 4
So this is mashing lots of details from PREAMB and SUPLINFO, so you may need to parse those nodes separately.
If you just take PREAMB, that's also a mess...
z2 <- getNodeSet(doc, "//NOTICE/PREAMB")
# check node names and notice different formats
sapply(z2, xmlSApply, xmlName)
## and count
sort( table(unlist(sapply(z2, xmlSApply, xmlName))) )
AUTH BILCOD NOTE GPOTABLE STARS PRTPAGE DATE FTNT GPH EFFDATE ADD DATES FP SIG DEPDOC EXTRACT SUM
2 3 3 5 5 8 15 15 15 16 19 24 32 37 45 47 52
AGY FURINF SUBAGY ACT AGENCY SUBJECT HD P
54 54 55 57 92 92 103 663
I see three different formats here, so xmlToDataFrame will work with some nodes but not all
x <- xmlToDataFrame(z2[1:4])
Compare these 10 columns to results from ldply in your code
doc_list <- getNodeSet(doc, "//NOTICE/PREAMB", fun=xmlToList)
## this returns 31 columns since it grabs every child node...
j <- ldply(doc_list[1:4], data.frame)
names(j)
I think it's sometimes better to just loop through the getNodeSet results and parse what you need, making sure to add NAs if the node is not present (using the xp function here). See ?getNodeSet on creating sub docs and fixing the memory leak using free, but maybe something like this for the most common format. You could add checks and grab additional columns for Notices with lots of HD, EXTRACT and P tags.
xp <- function (doc, tag){
n <- xpathSApply(doc, tag, xmlValue)
if (length(n) > 0)
# paste multiple values? BILCOD and probably others..
paste0(n, collapse="; ")
else NA
}
z <- getNodeSet(doc, "//NOTICE")
n <-length(z)
notices <-vector("list",n)
for(i in 1:n)
{
z2<-xmlDoc(z[[i]])
notices[[i]] <- data.frame(
AGENCY = xp(z2, "//AGENCY"),
SUBAGY = xp(z2, "//SUBAGY"),
SUBJECT = xp(z2, "//PREAMB/SUBJECT"), ## SUBJECT node in SECTION too, so it helps to be as specific as possible
ACT= xp(z2, "//ACT"),
SUM = xp(z2, "//SUM"),
DATES = xp(z2, "//DATES"),
ADD = xp(z2, "//ADD"),
FURINF = xp(z2, "//FURINF"),
SIG = xp(z2, "//PREAMB/SIG"), ## SIG in SUPLINF too
SUPLINF = xp(z2, "//SUPLINF"),
FRDOC = xp(z2, "//FRDOC"),
BILCOD = xp(z2, "//BILCOD"),
DEPDOC = xp(z2, "//DEPDOC"),
PRTPAGE = xp(z2, "//PRTPAGE"),
stringsAsFactors=FALSE)
free(z2)
}
x <- do.call("rbind", notices)
head(x)
table(is.na(x$ACT) )
FALSE TRUE
57 35
You still have columns like SUPLINF with lots of structured data mashed together - you could break that up if needed...
table(xpathSApply(doc, "//NOTICE/SUPLINF/child::node()", xmlName))
AMDPAR APPENDIX AUTH BILCOD DATE EXTRACT FP FTNT GPH GPOTABLE HD LSTSUB P PRTPAGE SIG text
1 1 10 1 4 10 23 31 10 12 186 1 783 4 52 1
xpathSApply(doc, "//NOTICE/SUPLINF/GPH", xmlValue)
[1] "EN18JA00.000" "EN18JA00.001" "EN18JA00.002" "EN18JA00.003" "EN18JA00.004" "EN18JA00.005" "EN18JA00.006" "EN18JA00.007" "EN18JA00.008" "EN18JA00.009"
## since SIG is in PREAMB and SUPLINF, you may want to parse that separately
xpathSApply(doc, "//NOTICE/SUPLINF/SIG", xmlValue)