Import all fields (and subfields) of XML as datafr

2019-04-09 10:52发布

问题:

To do some analysis I want to import a XML to a dataframe using R and the XML package. Example of XML file:

<watchers shop_name="TEST" created_at="September 14, 2012 05:44">
<watcher channel="Site Name">
    <code>123456</code>
    <search_key>TestKey</search_key>
    <date>September 14, 2012 04:15</date>
    <result>Found</result>
    <link>http://www.test.com/fakeurl</link>
    <price>100.0</price>
    <shipping>0.0</shipping>
    <origposition>0</origposition>
    <name>Name Test</name>
    <results>
        <result position="1">
            <c_name>CTest1</c_name>
            <c_price>599.49</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>599.49</c_total_price>
            <c_rating>8.3</c_rating>
            <c_delivery/>
        </result><result position="2">
            <c_name>CTest2</c_name>
            <c_price>654.0</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>654.0</c_total_price>
            <c_rating>9.8</c_rating>
            <c_delivery/>
        </result>
        <result position="3">
            <c_name>CTest3</c_name>
            <c_price>654.0</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>654.0</c_total_price>
            <c_rating>8.8</c_rating>
            <c_delivery/>
        </result>
    </results>
</watcher>
</watchers>

I want to have the rows of the dataframe containing the following fields:

shop_name   created_at  code    search_key  date    result
link    price   shipping    origposition    name    
position    c_name  c_price c_shipping  c_total_price   
c_rating    c_delivery

This means that the child nodes must be taken into account as well, which would result in a dataframe of three rows in this example (since the results show 3 positions). The fields

shop_name   created_at  code    search_key
date    result  link    price   shipping    
origposition    name

are the same for each of these rows.

I am able to go through the XML file, but I am unable to get a dataframe with the fields i want. When I convert the dataframe to a dataframe I get the following fields:

"code"       "search_key"      "date"     "result"  
"link" "price"      "shipping"   "origposition"  
"name"    "results"     

Here the fields

shop_name   created_at

are missing at the beginning and the 'results' are put together in a String under the column "results".

It must be possible to get the wanted dataframe, but I do not know how to do this exactly.

UPDATE

The solution provided by @MvG works brilliantly on the test XML file stated above. However the column 'result' can also have the value "Not Found". Entries with this value will miss certain fields (always the same filed) and therefore yield a "number of columns of arguments do not match"-error when running the solution. I would like these entries to be put in the dataframe as well, with the fields that are not present left empty. I do not understand how to incorporate this scenario.

test.xml

<watchers shop_name="TEST" created_at="September 14, 2012 05:44">
<watcher channel="Site Name">
    <code>123456</code>
    <search_key>TestKey</search_key>
    <date>September 14, 2012 04:15</date>
    <result>Found</result>
    <link>http://www.test.com/fakeurl</link>
    <price>100.0</price>
    <shipping>0.0</shipping>
    <origposition>0</origposition>
    <name>Name Test</name>
    <results>
        <result position="1">
            <c_name>CTest1</c_name>
            <c_price>599.49</c_price>
            <c_shipping>0.0</c_shipping>
            <c_total_price>599.49</c_total_price>
            <c_rating>8.3</c_rating>
            <c_delivery/>
        </result><result position="2">
            <c_name>CTest2</c_name>
            <c_price>654.0</c_price>
            <c_shipping>0.0</c_shipping>
        <c_total_price>654.0</c_total_price>
        <c_rating>9.8</c_rating>
        <c_delivery/>
    </result>
    <result position="3">
        <c_name>CTest3</c_name>
        <c_price>654.0</c_price>
        <c_shipping>0.0</c_shipping>
        <c_total_price>654.0</c_total_price>
        <c_rating>8.8</c_rating>
        <c_delivery/>
    </result>
</results>
</watcher>
<watcher channel="Shopping">
    <code>12804</code>
    <search_key></search_key>
    <date></date>
    <result>Not found</result>
    <link>https://www.test.com/testing1323p</link>
    <price>0.0</price>
    <shipping>0.0</shipping>
    <origposition>0</origposition>
    <name>MOOVM6002020</name>
    <results>
    </results>
</watcher>
</watchers>

回答1:

Here is a more generic approach. Every node is classified as one of three cases:

  • If the node name is of kind rows, then the data frames from child nodes will result in different rows of the result.
  • If the node name is of kind cols, then the data frames from child nodes will result in different columns of the result.
  • If the node name is of kind value, then a data frame with a single value will be constructed, using the node name as the column name and the node value as the column value.
  • For all three cases, attributes of the node will be added to the data frame.

The call for your application is given towards the bottom.

library(XML)

zeroColSingleRow <- function() {
  res <- data.frame(dummy=NA)
  res$dummy <- NULL
  stopifnot(nrow(res) == 1, ncol(res) == 0)
  return (res)
}

xml2df <- function(node, classifier) {
  if (! inherits(node, c("XMLInternalElementNode", "XMLElementNode"))) {
    return (zeroColSingleRow())
  }
  kind <- classifier(node)
  if (kind == "rows") {
    cdf <- lapply(xmlChildren(node), xml2df, classifier)
    if (length(cdf) == 0) {
      res <- zeroColSingleRow()
    }
    else {
      names <- unique(unlist(lapply(cdf, colnames)))
      cdf <- lapply(cdf, function(i) {
        missing <- setdiff(names, colnames(i))
        if (length(missing) > 0) {
          i[missing] <- NA
        }
        return (i)
      })
      res <- do.call(rbind, cdf)
    }
  }
  else if (kind == "cols") {
    cdf <- lapply(xmlChildren(node), xml2df, classifier)
    if (length(cdf) == 0) {
      res <- zeroColSingleRow()
    }
    else {
      res <- cdf[[1]]
      if (length(cdf) > 1) {
        for (i in 2:length(cdf)) {
          res <- merge(res, cdf[[i]], by=NULL)
        }
      }
    }
  }
  else {
    stopifnot(kind == "value")
    res <- data.frame(xmlValue(node))
    names(res) <- xmlName(node)
  }
  if (ncol(res) == 0) {
    res <- zeroColSingleRow()
  }
  attr <- xmlAttrs(node)
  if (length(attr) > 0) {
    attr <- do.call(data.frame, as.list(attr))
    res <- merge(attr, res, by=NULL)
  }
  rownames(res) <- NULL
  return(res)
}

doc<-xmlParse("test.xml")

xml2df(xmlRoot(doc), function(node) {
  name <- xmlName(node)
  if (name %in% c("watchers", "results"))
    return("rows")
  # make sure to treat results/result different from watcher/result
  if (name %in% c("watcher", "result") &&
      xmlName(xmlParent(node)) == paste0(name, "s"))
    return("cols")
  return("value")
})


回答2:

Here is one possibility:

attr2df <- function(n) do.call(data.frame, as.list(xmlAttrs(n)))
cbind(attr2df(xmlRoot(doc)), 
  do.call(rbind, xpathApply(doc, "//watcher", function(w) {
    x <- xmlToDataFrame(nodes = list(w))
    x$results<-NULL
    cbind(attr2df(w), x,
          xmlToDataFrame(nodes = getNodeSet(w, "results/result")))
  } ))
)

Iterate over all watchers. For each watcher, read its subtree to data frame x, and read its result nodes to another data frame. Remove the results from the first data frame, then bind the columns of both together, and throw in the attributes from the watcher as well. This application will yield one data.frame per watcher, and the outer rbind cal will combine them to a single data frame. The outermost cbind will add the attributes of the root node.

The result will have these names:

 [1] "shop_name"     "created_at"    "channel"       "code"         
 [5] "search_key"    "date"          "result"        "link"         
 [9] "price"         "shipping"      "position"      "name"         
[13] "c_name"        "c_price"       "c_shipping"    "c_total_price"
[17] "c_rating"      "c_delivery"   


标签: xml r import