I can read all xmls files that starts with <?xml version="1.0" encoding="utf-8"?>
but I can not read the files starts with <?xml version="1.0" encoding="ISO-8859-1"?>
.
Specifically, I have two files:
xml_iso.xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
</note>
xml-utf.xml:
<?xml version="1.0" encoding="utf-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
</note>
With the following code I can find the note
for the file with utf-8
but I can not find it in the file with the other encoding. How can I solve that?
Example code:
import unittest
from bs4 import BeautifulSoup as Soup
class TestEncoding(unittest.TestCase):
def test_iso(self):
with open('tests/xml-iso.xml', 'r') as f_in:
xml_soup = Soup(f_in.read(), 'xml')
print('xml-iso:\n{}'.format(xml_soup))
note = xml_soup.find('note')
self.assertIsNotNone(note)
def test_utf8(self):
with open('tests/xml-utf.xml', 'r') as f_in:
xml_soup = Soup(f_in.read(), 'xml')
print('xml-utf8:\n{}'.format(xml_soup))
note = xml_soup.find('note')
self.assertIsNotNone(note)
if __name__ == '__main__':
unittest.main()
Versions:
Python 3.5.2
beautifulsoup4==4.6.0
I have the exact same problem. My workaround is to not read the xml declaration:
Coincidentally I stumbled upon another workaround. Read the file in binary mode (
'rb'
):