R-XML pulling nodes into matrix/DF accounting for

2019-05-26 18:09发布

问题:

I am fairly new to using R and very new to using the XML package and xpath. I need to pull four elements from an xml file that looks like this (except that I have trimmed off a lot of other xmlnodes to simplify it here):

<?xml version="1.0" encoding="utf-8"?>
<iati-activities version="1.03" generated-datetime="2015-07-07T16:49:09+00:00">
  <iati-activity last-updated-datetime="2014-08-11T14:36:59+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100530</iati-identifier>
<title>Improvement of basic health care</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<participating-org role="Funding" ref="EU" type="15">EU</participating-org>
<participating-org role="Funding" type="21">Cordaid Memisa</participating-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="21">CORDAID RCA</participating-org>
<recipient-country percentage="100" code="CF">CENTRAL AFRICAN REPUBLIC</recipient-country>
<budget type="1">
  <period-start iso-date="2010-01-01"></period-start>
  <period-end iso-date="2013-02-28"></period-end>
</budget>
  </iati-activity>
  <iati-activity last-updated-datetime="2013-07-19T14:12:14+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100625</iati-identifier>
<title>Pigs for Pencils</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Funding" type="60">Stichting Kapatiran</participating-org>
<participating-org role="Implementing" type="22">PREDA Foundation Inc.</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
  <period-start iso-date="2010-04-20"></period-start>
  <period-end iso-date="2012-10-02"></period-end>
  <value value-date="2010-04-20">12500</value>
</budget>
   </iati-activity>
  <iati-activity last-updated-datetime="2015-04-08T03:01:58+00:00" xml:lang="en" default-currency="EUR">
    <iati-identifier>NL-KVK-41160054-100815</iati-identifier>
<title>Job and housing opportunities for women </title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="22">WISE</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
  <period-start iso-date="2010-10-01"></period-start>
  <period-end iso-date="2011-12-31"></period-end>
  <value value-date="2010-10-01">227000</value>
</budget>
  </iati-activity>
</iati-activities>

Also this is my first question ever on StackOverflow, so apologies if I'm not doing it correctly (and bc that xml is not perfectly aligned). The elements I need, and what I'm assigning them to are:

UniqueID <- "//iati-activity/iati-identifier"

GrantTitle <- "//iati-activity/title"

GrantAmount <- "//iati-activity/budget/value"

Recipient <- "//iati-activity/participatingorg[@role='Implementing']"

So far (after much trial and tribulation) I have come up with this code, that goes through the current node (x), pulling the 4 variables, and cbinding them into a row, then using xpathApply to loop through iati-activity nodes calling the function and rbinding the resulting rows together.

This code works when all four elements exist in each activity. However, note the absence of the budget/value node from the xml sample. This is because I removed it in order to solve this problem of missing nodes, which occur frequently in the full file for almost all the elements I need.

Also note the [1] at the end of my xpath expressions- I've included these because there are also multiple titles, multiple participating-orgs of all types, etc.

Given the multiples of some elements and the nonexistence of others, it makes it impossible to simple pull all the same elements into a vector and pop it into a data frame. Thus the need to loop through each activity pulling the elements with it. My code currently doesn't work to account for missing elements (the missing budget/value in the first iati-activity) because cbinding (and rbinding) ignore null vectors.

xmltestNA = xmlInternalTreeParse("XMLtoDF_TestNA.xml", useInternalNodes=TRUE)
bodyToDF <- function(x){
  UniqueID <- xpathSApply(x, "./iati-identifier", xmlValue)
  GrantTitle <- xpathSApply(x, "./title[1]", xmlValue)
  GrantAmount <- xpathSApply(x, "./budget/value[1]", xmlValue)
  Recipient <- xpathSApply(x, "./participating-org[@role='Implementing'][1]", xmlValue)
  cbind(UniqueID=UniqueID, GrantTitle=GrantTitle, GrantAmount=GrantAmount, Recipient=Recipient)
  }
res <-xpathApply(xmltestNA, '//iati-activity', fun=bodyToDF)
IatiNA <-do.call(rbind, res)
IatiNA

How can I keep the null values/missing nodes in order to turn it into a matrix or dataframe that looks like this:

    UniqueID    GrantTitle  GrantAmount Recipient
1   NL-KVK-41160054-100530  Improvement of basic health care    NA  CORDAID RCA
2   NL-KVK-41160054-100625  Pigs for Pencils    12500   PREDA Foundation Inc.
3   NL-KVK-41160054-100815  Job and housing opportunities for women     227000  WISE

Because I'm still new, the simpler the code, the better. Thanks in advance!

回答1:

If your xpath queries return too many or few results, I think it's easier to work with the nodes

doc <- xmlParse( '<your xml here>')
nodes<- getNodeSet(doc, "//iati-activity")

#Compare
xpathSApply(doc, "//budget/value", xmlValue)
xpathSApply(doc, "//participating-org[@role='Funding']", xmlValue)

sapply(nodes, function(x) xpathSApply(x, "./budget/value", xmlValue))
sapply(nodes, function(x) xpathSApply(x, "./participating-org[@role='Funding']", xmlValue))

Add a function to handle missing or multiple nodes and then create the data.frame

xpath2 <-function(x, path, fun = xmlValue, ...){
   y <- xpathSApply(x, path, fun, ...)
   ifelse(length(y) == 0, NA,
    ifelse(length(y) > 1, paste(unlist(y), collapse=", "), y))
}

GrantAmount <- sapply(nodes, xpath2, "./budget/value")
UniqueID    <- sapply(nodes, xpath2, "./iati-identifier")
GrantTitle  <- sapply(nodes, xpath2, "./title")
Recipient   <-  sapply(nodes, xpath2, "./participating-org[@role='Implementing']")
## updated xpath2 so xmlGetAttr will also work
Funding_ref  <- sapply(nodes, xpath2, "./participating-org[@role='Funding']", xmlGetAttr, "ref")
Budget_start <- sapply(nodes, xpath2, ".//period-start", xmlGetAttr, "iso-date")

data.frame(UniqueID, GrantTitle, GrantAmount, Recipient)
                UniqueID                               GrantTitle GrantAmount             Recipient
1 NL-KVK-41160054-100530         Improvement of basic health care        <NA>           CORDAID RCA
2 NL-KVK-41160054-100625                         Pigs for Pencils       12500 PREDA Foundation Inc.
3 NL-KVK-41160054-100815 Job and housing opportunities for women       227000                  WISE


回答2:

Consider using an XSLT transformation file. As information, XSLT is a declarative special-purpose programming language used to transform, style, or re-format XML files in various structures. Sadly this forgotten language in the web age is still very useful for every day needs.

Specifically for you, you need a simpler structure for R to easily import. The one caveat is R does not have a dedicated, reliable XSLT library. Therefore, you need to find an external XSLT processor. Fortunately, several open-source solutions are available which R can leverage.

XSLT Stylesheet

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes" omit-xml-declaration="no" />

<xsl:template match="iati-activities">
<GrantData>

    <xsl:for-each select="iati-activity">
    <iati-activity>
        <UniqueID><xsl:value-of select="iati-identifier"/></UniqueID>
        <GrantTitle><xsl:value-of select="title"/></GrantTitle>
        <GrantAmount><xsl:value-of select="budget/value"/></GrantAmount>
        <Recipient><xsl:value-of select="participating-org[@role='Implementing']"/></Recipient>
    </iati-activity>
    </xsl:for-each>

</GrantData>

</xsl:template>

</xsl:stylesheet>

Python (using its lxml library)

#!/usr/bin/python
import lxml.etree as ET

dom = ET.parse('C:\Path\To\RawXMLFile.xml')
xslt = ET.parse('C:\Path\To\RawXSLTFile.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)

tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)
print(tree_out)

xmlfile = open('C:\Path\To\OutputXMLFile.xml','ab')
xmlfile.write(tree_out)
xmlfile.close()

PHP (using its xsl extension)

This is for those who think PHP is not a general-purpose language but only restricted to the web:

<?php

// Load the XML source and XSLT file
$cd = dirname(__FILE__);

$xml = new DOMDocument;
$xml->load('C:\Path\To\RawXMLFile.xml');
$xsl = new DOMDocument;
$xsl->load('C:\Path\To\RawXSLTFile.xsl');    

// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules    

// Transform XML source
$newXml = $proc->transformToXML($xml);

// Save output to file
$xmlfile = 'C:\Path\To\OutputXMLFile.xml.xml';
file_put_contents($xmlfile, $newXml);  

?>

Both of which produces XML with a simpler structure for your R data frame needs (notice the empty node for GrantAmount).

<?xml version="1.0"?>

<GrantData>
  <iati-activity>
    <UniqueID>NL-KVK-41160054-100530</UniqueID>
    <GrantTitle>Improvement of basic health care</GrantTitle>
    <GrantAmount/>
    <Recipient>CORDAID RCA</Recipient>
  </iati-activity>
  <iati-activity>
    <UniqueID>NL-KVK-41160054-100625</UniqueID>
    <GrantTitle>Pigs for Pencils</GrantTitle>
    <GrantAmount>12500</GrantAmount>
    <Recipient>PREDA Foundation Inc.</Recipient>
  </iati-activity>
  <iati-activity>
    <UniqueID>NL-KVK-41160054-100815</UniqueID>
    <GrantTitle>Job and housing opportunities for women </GrantTitle>
    <GrantAmount>227000</GrantAmount>
    <Recipient>WISE</Recipient>
  </iati-activity>
</GrantData>

Once the simpler XML is generated, simply import into R:

library(XML)

# LOADING TRANSFORMED XML INTO R DATA FRAME
doc<-xmlParse("C:\\Path\\To\\OutputXMLFile.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//iati-activity"))

VBA (using the MSXML object)

Finally, there is a VBA solution to be run in MS Excel or MS Access. Fortunately, for this last solution you can connect the VBA with R using the R's RDCOMClient package (only a PC solution):

library(XML)
library(RDCOMClient)

xmlfile = COMCreate("MSXML2.DOMDocument")
xslfile = COMCreate("MSXML2.DOMDocument")
newxmlfile = COMCreate("MSXML2.DOMDocument")

xmlstr = "C:\\Path\\To\\RawXMLFile.xml"
xslstr = "C:\\Path\\To\\RawXSLTFile.xsl"
newxmlstr = "C:\\Path\\To\\OutputXMLFile.xml"

# LOADING XML & XSLT FILES
xmlfile.async = FALSE
xmlfile$Load(xmlstr)

xslfile.async = FALSE
xslfile$Load(xslstr)

# TRANSFORMING XML FILE USING XLST INTO NEW FILE
xmlfile$transformNodeToObject(xslfile, newxmlfile)
newxmlfile$Save(newxmlstr)

# LOADING TRANSFORMED XML INTO R DATA FRAME
doc<-xmlParse("C:\\Path\\To\\OutputXMLFile.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//iati-activity"))
View(xmldf)

xmlfile <- NULL
xslfile <- NULL
newxmlfile <- NULL

Data frame results with missing node as NaN:

    UniqueID                GrantTitle                          GrantAmount    Recipient
1   NL-KVK-41160054-100530  Improvement of basic health care                   CORDAID RCA
2   NL-KVK-41160054-100625  Pigs for Pencils                           12500   PREDA Foundation Inc.
3   NL-KVK-41160054-100815  Job and housing opportunities for women   227000   WISE