How should I deal with an XMLSyntaxError in Python

I'm trying to parse an XML file that's over 2GB with Python's lxml library. Unfortunately, the XML file does not have a line telling the character encoding, so I have to manually set it. While iterating through the file though, there are still some strange characters that come up once in a while.

I'm not sure how to determine the character encoding of the line, but furthermore, lxml will raise an XMLSyntaxError from the scope of the for loop. How can I properly catch this error, and deal with it correctly? Here's a simplistic code snippet:

from lxml import etree
etparse = etree.iterparse(file("my_file.xml", 'r'), events=("start",), encoding="CP1252")
for event, elem in etparse:
    if elem.tag == "product":
        print "Found the product!"
        elem.clear()

This eventually produces the error:

XMLSyntaxError: PCDATA invalid Char value 31, line 1565367, column 50

That line of the file looks like this:

% sed -n "1565367 p" my_file.xml
<romance_copy>Ravioli Florentine. Tender Ravioli Filled With Creamy Ricotta Cheese And

The 'F' of filled actually looks like this in my terminal:

xml line causing the error

标签： python xml encoding lxml

4条回答

Fickle 薄情

2楼-- · 2020-02-21 00:46

Found this thread from Google and while @Michael's answer ultimately lead me to a solution (to my problem at least) I wanted to provide a bit more of a copy/paste answer here for issues that can be solved so simply:

from lxml import etree

# Create a parser
parser = etree.XMLParser(recover=True)

parsed_file = etree.parse('/path/to/your/janky/xml/file.xml', parser=parser)

I was facing an issue where I had no control over the XML pre-processing and was being given a file with invalid characters. @Michael's answer goes on to elaborate on a way to approach invalid characters from which recover=True can't address. Fortunately for me, this was enough to keep things moving along.

0人赞添加讨论(0) 举报

\"骚年 ilove

3楼-- · 2020-02-21 00:49

The right thing to do here is make sure that the creator of the XML file makes sure that: A.) that the encoding of the file is declared B.) that the XML file is well formed (no invalid characters control characters, no invalid characters that are not falling into the encoding scheme, all elements are properly closed etc.) C.) use a DTD or an XML schema if you want to ensure that certain attributes/elements exist, have certain values or correspond to a certain format (note: this will take a performance hit)

So, now to your question. LXml supports a whole bunch of arguments when you use it to parse XML. Check out the documentation. You will want to look at these two arguments:

--> recover --> try hard to parse through broken XML
--> huge_tree --> disable security restrictions and support very deep trees and very long text content (only affects libxml2 2.7+)

They will help you to some degree, but certain invalid characters can just not be recovered from, so again, ensuring that the file is written correctly is your best bet to clean/well working code.

Ah yeah and one more thing. 2GB is huge. I assume you have a list of similar elements in this file (example list of books). Try to split the file up with a Regex Expression on the OS, then start multiple processes to part the pieces. That way you will be able to use more of your cores on your box and the processing time will go down. Of course you then have to deal with the complexity of merging the results back together. I can not make this trade off for you, but wanted to give it to you as "food for thought"

Addition to post: If you have no control over the input file and have bad characters in it, I would try to replace/remove these bad characters by iterating over the string before parsing it as a file. Here a code sample that removes Unicode control characters that you wont need:

#all unicode characters from 0x0000 - 0x0020 (33 total) are bad and will be replaced by "" (empty string)
for line in fileinput.input(xmlInputFileLocation, inplace=1):
    for pos in range(0,len(line)):
        if unichr(line[pos]) < 32:
            line[pos] = None
    print u''.join([c for c in line if c])

0人赞添加讨论(0) 举报

劳资没心，怎么记你

4楼-- · 2020-02-21 00:58

The codecs Python module supply an EncodedFile class that works as a wrapper to a file - you should pass an object of this class to lxml, set to replace unknown characters with XML char entities --

Try doing this:

from lxml import etree
import codecs

enc_file = codecs.EncodedFile(file("my_file.xml"), "ASCII", "ASCII", "xmlcharrefreplace")

etparse = etree.iterparse(enc_file, events=("start",), encoding="CP1252")
...

The "xmlcharrefreplace" constant passed is the "errors" parameter, and specifies what to do with unknown characters. It could be "strict" (raises an error), "ignore" (leave as is), "replace" (replaces char with "?"), "xmlrefreplace" (creates an "&#xxxx;" xml reference) or "backslahreplace" (creates a Python valid backslash reference). For more information, check: http://docs.python.org/library/codecs.html

0人赞添加讨论(0) 举报

beautiful°

5楼-- · 2020-02-21 01:03

I ran into this too, getting \x16 in data (the unicode 'synchronous idle' or 'SYN' character, displayed in the xml as ^V) which leads to an error when parsing the xml: XMLSyntaxError: PCDATA invalid Char value 22. The 22 is because because ord('\x16') is 22.

The answer from @michael put me on the right track. But some control characters below 32 are fine, like the return or the tab, and a few higher characters are still bad. So:

# Get list of bad characters that would lead to XMLSyntaxError.
# Calculated manually like this:
from lxml import etree
from StringIO import StringIO
BAD = []
for i in range(0, 10000):
    try:
        x = etree.parse(StringIO('<p>%s</p>' % unichr(i)))
    except etree.XMLSyntaxError:
        BAD.append(i)

This leads to a list of 31 characters that can be hardcoded instead of doing the above calculation in code:

BAD = [
    0, 1, 2, 3, 4, 5, 6, 7, 8,
    11, 12,
    14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
    # Two are perfectly valid characters but go wrong for different reasons.
    # 38 is '&' which gives: xmlParseEntityRef: no name.
    # 60 is '<' which gives: StartTag: invalid element namea different error.
]
BAD_BASESTRING_CHARS = [chr(b) for b in BAD]
BAD_UNICODE_CHARS = [unichr(b) for b in BAD]

Then use it like this:

def remove_bad_chars(value):
    # Remove bad control characters.
    if isinstance(value, unicode):
        for char in BAD_UNICODE_CHARS:
            value = value.replace(char, u'')
    elif isinstance(value, basestring):
        for char in BAD_BASESTRING_CHARS:
            value = value.replace(char, '')
    return value

If value is 2 Gigabyte you might need to do this in a more efficient way, but I am ignoring that here, although the question mentions it. In my case, I am the one creating the xml file, but I need to deal with these characters in the original data, so I will use this function before putting data in the xml.

0人赞添加讨论(0) 举报

How should I deal with an XMLSyntaxError in Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间