Can I change BeautifulSoup's behavior regardin

2020-07-11 06:17发布

问题:

I'm working on code to parse a configuration file written in XML, where the XML tags are mixed case and the case is significant. Beautiful Soup appears to convert XML tags to lowercase by default, and I would like to change this behavior.

I'm not the first to ask a question on this subject [see here]. However, I did not understand the answer given to that question and in BeautifulSoup-3.1.0.1 BeautifulSoup.py does not appear to contain any instances of "encodedName" or "Tag.__str__"

回答1:

import html5lib
from html5lib import treebuilders

f = open("mydocument.html")
parser = html5lib.XMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
document = parser.parse(f)

'document' is now a BeautifulSoup-like tree, but retains the cases of tags. See html5lib for documentation and installation.



回答2:

According to Leonard Richardson, creator|maintainer of Beautiful Soup, you can't.



回答3:

It's much better to use lxml. It's much, much faster than BeautifulSoup. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

It's more suited for XML as well.



回答4:

Background and Reason

first we should know: html parser is case-insensitive so convert tag to lowercase

and: Beautifulsoup internally call some parser to parse input html/xml.

-> for latest bs4=BeautifulSoup v4, default use html.parer.

soup = BeautifulSoup(yourXmlStr, 'html.parser')

but ALL html parser is case-insensitive, so html.parer

( and other two, say in official doc:

  • lxml
    • BeautifulSoup(yourHtmlOrXmlStr, "lxml")
  • html5lib
    • BeautifulSoup(yourHtmlOrXmlStr, "html5lib")

), will convert TAG to lowercase tag

Example

  • input:
<?xml version="1.0" encoding="UTF-8"?>
<XCUIElementTypeApplication type="XCUIElementTypeApplication" name="微信"
    label="微信" enabled="true" visible="true" x="0" y="0" width="375" height="667">
    <XCUIElementTypeWindow
            type="XCUIElementTypeWindow" enabled="true" visible="true" x="0" y="0" width="375" height="667">
    </XCUIElementTypeWindow>
</XCUIElementTypeApplication>
  • output:
<?xml version="1.0" encoding="UTF-8"?>
    <xcuielementtypeapplication enabled="true" height="667" label="微信" name="微信" type="XCUIElementTypeApplication" visible="true" width="375" x="0" y="0">
    <xcuielementtypewindow enabled="true" height="667" type="XCUIElementTypeWindow" visible="true" width="375" x="0" y="0">
    </xcuielementtypewindow>
    </xcuielementtypeapplication>

How disable BeautifulSoup tag-auto-lowercase-convertion ?

  • Solution: change to xml parser
  • Reason: xml parser support tag case-sensitive
    • -> not auto convert tag to all lowercase
  • Code
soup = BeautifulSoup(yourXmlStr, 'xml')

same as:

soup = BeautifulSoup(yourXmlStr, 'lxml-xml')
  • output example:
<?xml version="1.0" encoding="utf-8"?>
    <XCUIElementTypeApplication enabled="true" height="667" label="微信" name="微信" type="XCUIElementTypeApplication" visible="true" width="375" x="0" y="0">
    <XCUIElementTypeWindow enabled="true" height="667" type="XCUIElementTypeWindow" visible="true" width="375" x="0" y="0">
    </XCUIElementTypeWindow>
    </XCUIElementTypeApplication>

More detail

please refer my (Chinese) post: 【已解决】Python的BeautifulSoup中XML标签tag为何会自动被转换成小写以及是否可以禁止



回答5:

just use a propper xml parser instead of a lib thats made to deal with broken files

i suggest to just take a look at xml.etree or lxml