Convert (possibly malformed) xml into Data Frame i

2019-05-29 07:09发布

问题:

I am trying to convert an xml file from the US Federal Register archive into a data frame where each row corresponds to a particular action (e.g., Notice, Rule, Proposed Rule) and each column contains an attribute related to that action (e.g., Agency Type, Subject, etc). I have tried the following:

> setwd("C:/Users/mwilliamson/Desktop/FedReg/2000/01/")
> url = "FR-2000-01-18.xml"
> doc <- xmlInternalTreeParse("FR-2000-01-18.xml")
> doc_list <- xmlToList(doc)
> library(plyr)
> j <- ldply(doc_list, data.frame)

However, it returns an error:

Error in data.frame(SECTNO = "§ 831.502", SUBJECT = "Automatic separation;  
exemption.",  : 
arguments imply differing number of rows: 1, 0

It appears that the number of blank values and the differences in the length of the variables is creating an issue as R processes the XML (I may be wrong here, not much experience with the xml package). I thought it might be possible to use the schema (.xsd) file to avoid this, but it is not clear how I use the schema with xmlToList. Essentially, I am looking for the "best" way to process the xml into the data frame I described and fill any blank cells with NA. I have uploaded a the schema and a sample file to:

https://www.dropbox.com/sh/pluje12t185w1v2/ys1xHzilQO

Any help you can provide would be great!!

UPDATE: I have also tried:

xmlToDataFrame(doc, colClasses = character, homogeneous = NA)

but receive the following:

Error: duplicate subscripts for columns

Again, many thanks for any help you might offer.

UPDATE: It appears that the /AGENCY node is where the data begins to actually fit the format I am attempting to create; however, I can't seem to extract all of the rest of the data (i.e., I can get a single column with 115 records identifying the agency, but can't get the rest of the information related to those 115 records). I have tried the following:

out <- getNodeSet(doc, "//*", fun=xmlToList)
df <- data.frame(do.call(rbind, out))
head(df)

but it seems to cause R to crash. I am hoping that my continued updates will inspire someone to lend a hand. Thanks again for any help you can give.

回答1:

This XML is a mess and my guess is that you'll need to parse each action separately.

table(xpathSApply(doc, "//FEDREG/child::node()", xmlName))
    DATE  NEWPART       NO  NOTICES PRESDOCS PRORULES    RULES UNITNAME      VOL 
      12        6       12        1        3        1        1       12       12 

table(xpathSApply(doc, "//NOTICES/child::node()", xmlName))
   NOTICE 
       92 

Get notices using getNodeSet

z <- getNodeSet(doc, "//NOTICE")
z[[1]]
# check node names
sapply(z, xmlSApply, xmlName)
x <- xmlToDataFrame(z)
dim(x)
[1] 92  4

So this is mashing lots of details from PREAMB and SUPLINFO, so you may need to parse those nodes separately.

If you just take PREAMB, that's also a mess...

z2 <- getNodeSet(doc, "//NOTICE/PREAMB")
# check node names and notice different formats
sapply(z2, xmlSApply, xmlName)
## and count
sort( table(unlist(sapply(z2, xmlSApply, xmlName))) )
AUTH   BILCOD     NOTE GPOTABLE    STARS  PRTPAGE     DATE     FTNT      GPH  EFFDATE      ADD    DATES       FP      SIG   DEPDOC  EXTRACT      SUM 
   2        3        3        5        5        8       15       15       15       16       19       24       32       37       45       47       52 
 AGY   FURINF   SUBAGY      ACT   AGENCY  SUBJECT       HD        P 
  54       54       55       57       92       92      103      663 

I see three different formats here, so xmlToDataFrame will work with some nodes but not all

x <- xmlToDataFrame(z2[1:4])

Compare these 10 columns to results from ldply in your code

doc_list <-  getNodeSet(doc, "//NOTICE/PREAMB", fun=xmlToList)
## this returns 31 columns since it grabs every child node...
j <- ldply(doc_list[1:4], data.frame)
names(j)

I think it's sometimes better to just loop through the getNodeSet results and parse what you need, making sure to add NAs if the node is not present (using the xp function here). See ?getNodeSet on creating sub docs and fixing the memory leak using free, but maybe something like this for the most common format. You could add checks and grab additional columns for Notices with lots of HD, EXTRACT and P tags.

xp <- function (doc, tag){
   n <- xpathSApply(doc, tag, xmlValue)
   if (length(n) > 0) 
      # paste multiple values?  BILCOD and probably others..
      paste0(n, collapse="; ") 
   else NA
}


  z <- getNodeSet(doc, "//NOTICE")
  n <-length(z)
  notices <-vector("list",n)
  for(i in 1:n)
  {
     z2<-xmlDoc(z[[i]])
     notices[[i]] <- data.frame(
      AGENCY = xp(z2, "//AGENCY"),
      SUBAGY = xp(z2, "//SUBAGY"),
      SUBJECT = xp(z2, "//PREAMB/SUBJECT"),    ##  SUBJECT node in SECTION too, so it helps to be as specific as possible
      ACT= xp(z2, "//ACT"),
      SUM = xp(z2, "//SUM"),
      DATES = xp(z2, "//DATES"),
      ADD = xp(z2, "//ADD"),
      FURINF = xp(z2, "//FURINF"),
      SIG = xp(z2, "//PREAMB/SIG"),     ## SIG in SUPLINF too
      SUPLINF = xp(z2, "//SUPLINF"),
      FRDOC = xp(z2, "//FRDOC"),
      BILCOD = xp(z2, "//BILCOD"),
      DEPDOC = xp(z2, "//DEPDOC"),
      PRTPAGE = xp(z2, "//PRTPAGE"),
       stringsAsFactors=FALSE)
     free(z2)  
  }
  x <- do.call("rbind", notices)
  head(x)
  table(is.na(x$ACT) )
  FALSE  TRUE 
     57    35 

You still have columns like SUPLINF with lots of structured data mashed together - you could break that up if needed...

table(xpathSApply(doc, "//NOTICE/SUPLINF/child::node()", xmlName))

AMDPAR APPENDIX     AUTH   BILCOD     DATE  EXTRACT       FP     FTNT      GPH GPOTABLE       HD   LSTSUB        P  PRTPAGE      SIG     text 
     1        1       10        1        4       10       23       31       10       12      186        1      783        4       52        1 

xpathSApply(doc, "//NOTICE/SUPLINF/GPH", xmlValue)
[1] "EN18JA00.000" "EN18JA00.001" "EN18JA00.002" "EN18JA00.003" "EN18JA00.004" "EN18JA00.005" "EN18JA00.006" "EN18JA00.007" "EN18JA00.008" "EN18JA00.009"
 ## since SIG is in PREAMB and SUPLINF, you may want to parse that separately
 xpathSApply(doc, "//NOTICE/SUPLINF/SIG", xmlValue)