I am new to XML.
i downloaded a XML file, called ipg140722,from google drive, http://www.google.com/googlebooks/uspto-patents-grants-text.html
, I used Window 8.1, R 3.1.1,
library(XML)
url<- "E:\\clouddownload\\R-download\\ipg140722.xml"
indata<- xmlTreeParse(url)
XML declaration allowed only at the start of the document
Extra content at the end of the document
error: 1: XML declaration allowed only at the start of the document
2: Extra content at the end of the document
what is the problem
Note: This post is edited from the original version.
The object lesson here is that just because a file has an xml
extension does not mean it is well formed XML.
If @MartinMorgan is correct about the file, Google seems to have taken all the patents approved during the week of 2014-07-22 (last week), converted them to XML, strung them together into a single text file, and given that an xml
extension. Clearly this is not well-formed XML. So the challenge is to deconstruct that file. Here is away to do it in R.
lines <- readLines("ipg140722.xml")
start <- grep('<?xml version="1.0" encoding="UTF-8"?>',lines,fixed=T)
end <- c(start[-1]-1,length(lines))
library(XML)
get.xml <- function(i) {
txt <- paste(lines[start[i]:end[i]],collapse="\n")
# print(i)
xmlTreeParse(txt,asText=T)
# return(i)
}
docs <- lapply(1:10,get.xml)
class(docs[[1]])
# [1] "XMLInternalDocument" "XMLAbstractDocument"
So now docs
is a list of parsed XML documents. These can be accessed individually as, e.g., docs[[1]]
, or collectively using something like the code below, which extracts the invention title from each document.
sapply(docs,function(doc) xmlValue(doc["//invention-title"][[1]]))
# [1] "Phallus retention harness" "Dress/coat"
# [3] "Shirt" "Shirt"
# [5] "Sandal" "Shoe"
# [7] "Footwear" "Flexible athletic shoe sole"
# [9] "Shoe outsole with a surface ornamentation contrast" "Shoe sole"
And no, I did not make up the name of the first patent.
Response to OPs comment
My original post, which detected the start of a new document using:
start <- grep("xml version",lines,fixed=T)
was too naive: it turns out the phrase "xml version" appears in the text of some of the patents. So this was breaking (some of) the documents prematurely, resulting in mal-formed XML. The code above fixes that problem. If you un-coment the two lines in the function get.xml(...)
and run the code above with
docs <- lapply(1:length(start),get.xml)
you will see that all 6961 documents parse correctly.
But there is another problem: the parsed XML is very large, so if you leave these lines as comments and try to parse the full set, you run out of memory about half way through (or I did, on an 8GB system). There are two ways to work around this. The first is to do the parsing in blocks (say 2000 documents at a time). The second is to extract whatever information you need for your CSV file in get.xml(...)
and discard the parsed document at each step.