I have an XML structure that looks like the following, but on a much larger scale:
<root>
<conference name='1'>
<author>
Bob
</author>
<author>
Nigel
</author>
</conference>
<conference name='2'>
<author>
Alice
</author>
<author>
Mary
</author>
</conference>
</root>
For this, I used the following code:
dom = parse(filepath)
conference=dom.getElementsByTagName('conference')
for node in conference:
conf_name=node.getAttribute('name')
print conf_name
alist=node.getElementsByTagName('author')
for a in alist:
authortext= a.nodeValue
print authortext
However, the authortext that is printed out is 'None.' I tried messing around with using variations like what is below, but it causes my program to break.
authortext=a[0].nodeValue
The correct output should be:
1
Bob
Nigel
2
Alice
Mary
But what I get is:
1
None
None
2
None
None
Any suggestions on how to tackle this problem?
Quick access:
Element nodes don't have a nodeValue. You have to look at the Text nodes inside them. If you know there's always one text node inside you can say
element.firstChild.data
(data is the same as nodeValue for text nodes).Be careful: if there is no text content there will be no child Text nodes and
element.firstChild
will be null, causing the.data
access to fail.Quick way to get the content of direct child text nodes:
In DOM Level 3 Core you get the
textContent
property you can use to get text from inside an Element recursively, but minidom doesn't support this (some other Python DOM implementations do).I played around with it a bit, and here's what I got to work:
leading to output of:
I can't tell you exactly why you have to access the childNode to get the inner text, but at least that's what you were looking for.
your
authortext
is of type 1 (ELEMENT_NODE
), normally you need to haveTEXT_NODE
to get a string. This will workSince you always have one text data value per author you can use element.firstChild.data