I am parsing a XML file with getNodeSet()
. Assume I have a XML file from a bookstore with 4 different books listed, but for one book the tag "authors" is missing.
If I parse the XML for the tag "authors" by using data.nodes.2 <- getNodeSet(data,'//*/authors')
, R returns a list of 3 elements.
However, this is not exactly what I want. How do get "getNodeSet()" to return a list which has 4 instead of three elements, i.e. one element that has a missing value where the tag "authors" does not exist.
I appreciate any help.
library(XML)
file <- "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\r\n<!-- Edited by XMLSpy® -->\r\n<bookstore>\r\n<book category=\"cooking\">\r\n<title lang=\"en\">Everyday Italian</title>\r\n<authors>\r\n<author>Giada De Laurentiis</author>\r\n</authors>\r\n<year>2005</year>\r\n<price>30.00</price>\r\n</book>\r\n<book category=\"children\">\r\n<title lang=\"en\">Harry Potter</title>\r\n<authors>\r\n<author>J K. Rowling</author>\r\n</authors>\r\n<year>2005</year>\r\n<price>29.99</price>\r\n</book>\r\n<book category=\"web\">\r\n<title lang=\"en\">XQuery Kick Start</title>\r\n<authors>\r\n<author>James McGovern</author>\r\n<author>Per Bothner</author>\r\n<author>Kurt Cagle</author>\r\n<author>James Linn</author>\r\n<author>Vaidyanathan Nagarajan</author>\r\n</authors>\r\n<year>2003</year>\r\n<price>49.99</price>\r\n</book>\r\n<book category=\"web\" cover=\"paperback\">\r\n<title lang=\"en\">Learning XML</title>\r\n\r\n<year>2003</year>\r\n<price>39.95</price>\r\n</book>\r\n</bookstore>"
data <- xmlParse(file)
data.nodes.1 <- getNodeSet(data,'//*/book')
data.nodes.2 <- getNodeSet(data,'//*/authors')
# Data
# <?xml version="1.0" encoding="ISO-8859-1"?>
# <!-- Edited by XMLSpy® -->
# <bookstore>
# <book category="cooking">
# <title lang="en">Everyday Italian</title>
# <authors>
# <author>Giada De Laurentiis</author>
# </authors>
# <year>2005</year>
# <price>30.00</price>
# </book>
# <book category="children">
# <title lang="en">Harry Potter</title>
# <authors>
# <author>J K. Rowling</author>
# </authors>
# <year>2005</year>
# <price>29.99</price>
# </book>
# <book category="web">
# <title lang="en">XQuery Kick Start</title>
# <authors>
# <author>James McGovern</author>
# <author>Per Bothner</author>
# <author>Kurt Cagle</author>
# <author>James Linn</author>
# <author>Vaidyanathan Nagarajan</author>
# </authors>
# <year>2003</year>
# <price>49.99</price>
# </book>
# <book category="web" cover="paperback">
# <title lang="en">Learning XML</title>
# <year>2003</year>
# <price>39.95</price>
# </book>
# </bookstore>
You can also try xmlToDataFrame for some XML
If you don't like the authors mashed together, you can sometimes fix that with pattern matching
Other options are to loop through the book nodes (see ?getNodeSet to create and free subnodes) or follow Martin's answer (and if you want 4 rows instead, try this)
One option is to use R's list processing to extract authors from each node
and to munge that with book-level info
giving
Here is an xml2 approach.
The code is very readable, and therefor easy to maintain.
code
ouput
sample data
You can use the
plyr
library