I'm working on code to parse a configuration file written in XML, where the XML tags are mixed case and the case is significant. Beautiful Soup appears to convert XML tags to lowercase by default, and I would like to change this behavior.
I'm not the first to ask a question on this subject [see here]. However, I did not understand the answer given to that question and in BeautifulSoup-3.1.0.1 BeautifulSoup.py does not appear to contain any instances of "encodedName
" or "Tag.__str__
"
import html5lib
from html5lib import treebuilders
f = open("mydocument.html")
parser = html5lib.XMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
document = parser.parse(f)
'document' is now a BeautifulSoup-like tree, but retains the cases of tags. See html5lib for documentation and installation.
According to Leonard Richardson, creator|maintainer of Beautiful Soup, you can't.
It's much better to use lxml. It's much, much faster than BeautifulSoup. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
It's more suited for XML as well.
Background and Reason
first we should know: html parser is case-insensitive so convert tag to lowercase
and: Beautifulsoup
internally call some parser
to parse input html
/xml
.
-> for latest bs4
=BeautifulSoup v4
, default use html.parer
.
soup = BeautifulSoup(yourXmlStr, 'html.parser')
but ALL html parser
is case-insensitive, so html.parer
( and other two, say in official doc:
lxml
BeautifulSoup(yourHtmlOrXmlStr, "lxml")
html5lib
BeautifulSoup(yourHtmlOrXmlStr, "html5lib")
), will convert TAG to lowercase tag
Example
<?xml version="1.0" encoding="UTF-8"?>
<XCUIElementTypeApplication type="XCUIElementTypeApplication" name="微信"
label="微信" enabled="true" visible="true" x="0" y="0" width="375" height="667">
<XCUIElementTypeWindow
type="XCUIElementTypeWindow" enabled="true" visible="true" x="0" y="0" width="375" height="667">
</XCUIElementTypeWindow>
</XCUIElementTypeApplication>
<?xml version="1.0" encoding="UTF-8"?>
<xcuielementtypeapplication enabled="true" height="667" label="微信" name="微信" type="XCUIElementTypeApplication" visible="true" width="375" x="0" y="0">
<xcuielementtypewindow enabled="true" height="667" type="XCUIElementTypeWindow" visible="true" width="375" x="0" y="0">
</xcuielementtypewindow>
</xcuielementtypeapplication>
How disable
BeautifulSoup tag-auto-lowercase-convertion ?
- Solution: change to xml parser
- Reason: xml parser support tag case-sensitive
- -> not auto convert tag to all lowercase
- Code
soup = BeautifulSoup(yourXmlStr, 'xml')
same as:
soup = BeautifulSoup(yourXmlStr, 'lxml-xml')
<?xml version="1.0" encoding="utf-8"?>
<XCUIElementTypeApplication enabled="true" height="667" label="微信" name="微信" type="XCUIElementTypeApplication" visible="true" width="375" x="0" y="0">
<XCUIElementTypeWindow enabled="true" height="667" type="XCUIElementTypeWindow" visible="true" width="375" x="0" y="0">
</XCUIElementTypeWindow>
</XCUIElementTypeApplication>
More detail
please refer my (Chinese) post: 【已解决】Python的BeautifulSoup中XML标签tag为何会自动被转换成小写以及是否可以禁止
just use a propper xml parser instead of a lib thats made to deal with broken files
i suggest to just take a look at xml.etree or lxml