I need to escape special characters in an invalid XML file which is about 5000 lines long. Here's an example of the XML that I have to deal with:
<root>
<element>
<name>name & surname</name>
<mail>name@name.org</mail>
</element>
</root>
Here the problem is the character "&" in the name. How would you escape special characters like this with a Python library? I didn't find a way to do it with BeautifulSoup.
If you don't care about invalid characters in the xml you could use XML parser's
recover
option (see Parsing broken XML with lxml.etree.iterparse):Output
You're probably just wanting to do some simple regexp-ery on the HTML before throwing it into BeautifulSoup.
Even simpler, if there aren't any SGML entities (
&...;
) in the code,html=html.replace('&','&')
will do the trick.Otherwise, try this:
Essentially the regex looks for
&
not followed by alpha-numeric or # characters. It won't deal with ampersands at the end of lines, but that's probably fixable.This answer provides XML sanitizer functions, although they don't escape the unescaped characters, but simply drop them instead.
Using bs4 with lxml
The question wondered how to do it with Beautiful Soup. Here is a function which will sanitize a small XML
bytes
object with it. It was tested with the package requirementsbeautifulsoup4==4.8.0
andlxml==4.4.0
. Note thatlxml
is required here bybs4
.Using only lxml
Obviously there is not much of a point in using both
bs4
andlxml
when this can be done withlxml
alone. Thislxml==4.4.0
using sanitizer function is essentially derived from the answer by jfs.is not well-formed XML. It should be:
All conformant XML tools should create this - you normally do not have to worry. If you create a string with the '&' character then an XML tool will output the escaped version. If you create the string by hand it is your responsibility to make sure it is escaped. If you use an XML editor it should escape it for you.
If the file has been given you by someone else, send it back and tell them it is not well-formed. If they no longer exist you will have to use a plain text editor. That's fragile and messy but there is no other way. If the file has ampersands elsewhere that are used for escaping then the file is garbage.
See a 10-year-old post here and a later one here.