Undefined entity error while using ElementTree

2019-07-17 05:42发布

问题:

I have a set of XML files that I need to read and format into a single CSV file. In order to read from the XML files, I have used the solution mentioned here.

My code looks like this:

from os import listdir
import xml.etree.cElementTree as et

files = listdir(".../blogs/")

for i in range(len(files)):
    # fname = ".../blogs/" + files[i]
    f = open(".../blogs/" + files[i], 'r')
    contents = f.read()
    tree=et.fromstring(contents)
    for el in tree.findall('post'):
        post = el.text

    f.close()

This gives me the error cElementTree.ParseError: undefined entity: at the line tree=et.fromstring(contents). Oddly enough, when I run each of the commands on command line Python (without the for-loop though), it runs perfectly.

In case you want to know the XML structure, it is like this:

<Blog>
<date> some date </date>
<post> some blog post </post>
</Blog>

So what is causing this error, and how come it doesn't run from the Python file, but runs from the command line?

Update: After reading this link I checked files[0] and found that '&' symbol occurs a few times. I think that might be causing the problem. I used a random file to read when I ran the same commands on command line.

回答1:

As I mentioned in the update, there were some symbols that I suspected might be causing a problem. The reason the error didn't come up when I ran the same lines on the command line is because I would randomly pick a file that didn't have any such characters.

Since I mainly required the content between the <post> and </post> tags, I created my own parser (as was suggested in the link mentioned in the update).

from os import listdir

files = listdir(".../blogs/")

for i in range(len(files)):

    f = open(".../blogs/" + files[i], 'r')
    contents = f.read()
    seek1 = contents.find('<post>')
    seek2 = contents.find('</post>', seek1+1)
    while(seek1!=-1):
        post = contents[seek1+5:seek2+6]
        seek1 = contents.find('<post>', seek1+1)
        seek2 = contents.find('</post>', seek1+1)

    f.close()