I've looked through several posts but I haven't quite found any answers that have solved my problem.
Sample XML =
<TextWithNodes>
<Node id="0"/>TEXT1<Node id="19"/>TEXT2 <Node id="20"/>TEXT3<Node id="212"/>
</TextWithNodes>
So I understand that usually if I had extracted TextWithNodes
as a NodeList
I would do something like
nodeList = TextWithNodes[0].getElementsByTagName('Node')
for a in nodeList:
node = a.nodeValue
print node
All I get is None
. I've read that you must write a.childNodes.nodeValue
but there isn't a child node to the node list since it looks like all the Node
Ids are closing tags? If I use a.childNodes
I get []
.
When I get the node type for a
it is type 1 and TEXT_NODE
= 3. I'm not sure if that is helpful.
I would like to extract TEXT1
, TEXT2
, etc.
You should use the ElementTree api instead of minidom for your task (as explained in the other answers here), but if you need to use minidom, here is a solution.
What you are looking for was added to DOM level 3 as the
textContent
attribute. Minidom only supports level 1.However you can emulate textContent pretty closely with this function:
Which you can then use like so:
Notice how I got the text content of the parent node
TextWithNodes
. This is because yourNode
elements are siblings of those text nodes, not parents of them.Using
xml.etree.ElemetTree
(which is similar to lxml which @DiegoNavrro used in his answer, except that etree in part of the standard library and doesn't have XPATH etc.) you can give the following a go:Note, this assumes that the XML
<Node id="0"/>TEXT1
... is correct. Because the text follows a closing tag, it becomes the tag's tail text. It is not the elements nodeValue, which is why in your code in the question you are gettingNone
s.If you wanted to parse some XML like
<Node id="0">TEXT1</Node>
you would have to replace the line[element.tail for element in xml_etree]
with[element.text for element in xml_etree]
.A solution with
lxml
right from the docs:You also can extract the text of an specific node:
The issue here is the text in the XML doesn't belong to any node.