I have an XML file (a TEI-encoded play) that I want to process into a data.frame in R, where every row of the data.frame contains one line of the play, the line number, the speaker of that line, the scene number, and the scene type. The body of the XML file looks like this (but longer):
<text>
<body>
<div1 type="scene" n="1">
<sp who="fau">
<l n="30">Settle thy studies, Faustus, and begin</l>
<l n="31">To sound the depth of that thou wilt profess;</l>
<l n="32">Having commenced, be a divine in show,</l>
</sp>
<sp who="eang">
<l n="105">Go forward, Faustus, in that famous art,</l>
</sp>
</div1>
<div1 type="scene" n="2">
<sp who="sch1">
<l n="NA">I wonder what's become of Faustus, that was wont to make our schools ring with sic probo.</l>
</sp>
<sp who="sch2">
<l n="NA">That shall we know, for see here comes his boy.</l>
</sp>
<sp who="sch1">
<l n="NA">How now sirrah, where's thy master?</l>
</sp>
<sp who="wag">
<l n="NA">God in heaven knows.</l>
</sp>
</div1>
</body>
</text>
The problem seems similar to questions posed here and here, but my XML file is structured slightly differently, so neither has given me a working solution. I've managed to do this:
library(XML)
doc <- xmlTreeParse("data/faustus_sample.xml", useInternalNodes=TRUE)
bodyToDF <- function(x){
scenenum <- xmlGetAttr(x, "n")
scenetype <- xmlGetAttr(x, "type")
attributes <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlAttrs)
linecontent <- sapply(xmlChildren(x), xmlValue)
data.frame(scenenum = scenenum, scenetype = scenetype, attributes = attributes, linecontent = linecontent, stringsAsFactors = FALSE)
}
res <- xpathApply(doc, '//div1', bodyToDF)
temp.df <- do.call(rbind, res)
This returns a data.frame with 'scene number', 'scene type', and 'speaker' intact, but I can't work out how to break it down to each line (and get the associated line number).
I tried importing the file as a list (via xmlToList), but this gave me an incredibly messy list of lists of lists, and it also resulted in a lot of different errors if I attempted to use for loops to access the different elements (terrible idea, I know!).
Ideally, I'm looking for a solution that will work on the full file in all its messiness and also work for other, similarly structured XML files.
I've just started using R and am totally at a loss. Any assistance you can provide will be hugely appreciated.
Thanks for your help!
EDIT: a copy of the full xml file is available here.