I am trying to extract information from an XML file from ClinicalTrials.gov. The file is organized in the following way:
<clinical_study>
...
<brief_title>
...
<location>
<facility>
<name>
<address>
<city>
<state>
<zip>
<country>
</facility>
<status>
<contact>
<last_name>
<phone>
<email>
</contact>
</location>
<location>
...
</location>
...
</clinical_study>
I can use the R XML package from CRAN in the following code to extract all location nodes from the XML file:
library(XML)
clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)
locations <- xmlToDataFrame(getNodeSet(xmlDoc,"//location"))
This works kind of ok. However, if you look at the data frame, you will notice that the xmlToDataFrame function lumped together everything under <facility>
into a single concatenated string. A solution would be to write code to generate the data frame column by column, for example, you could generate
You could flatten the XML first.
This answer converts the XML to a list, unlists each location section, transposes the section, converts the section to a
data.table
, and then usesrbindlist
to merge all of the individual locations into one table. Thefill=T
argument matches the elements by name, and fills in missing element values withNA
.