I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']
infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
As you can see, I'm detecting the encoding using chardet
, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?
chardet
detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:Note:
chardet
may return'UTF-XXLE'
,'UTF-XXBE'
encodings that leave the BOM in the text.'LE'
,'BE'
should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in @ivan_pozdeev's answer.To avoid
UnicodeEncodeError
while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.There is no reason to check if a BOM exists or not,
utf-8-sig
manages that for you and behaves exactly asutf-8
if the BOM does not exist:In the example above, you can see
utf-8-sig
correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just useutf-8-sig
and not worry about itA variant of @ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517)
Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:
I've composed a nifty BOM-based detector based on Chewie's answer. It's sufficient in the common use case where data can be either in a known local encoding or Unicode with BOM (that's what text editors typically produce). More importantly, unlike
chardet
, it doesn't do any random guessing, so it gives predictable results:I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic (
chardet
) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature (utf-8-sig
vs. the commonutf-8
) that doesn't seem to have an analog in the UTF-16 family.The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:
This works with all the interesting UTF codecs (e.g.
utf-8
,utf-16le
,utf-16be
, ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specificcodec
constants.To write a BOM:
This works with any encoding. UTF-16 big endian is just an example.
This is not, btw, to dismiss
chardet
. It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the
utf-8-sig
encoding. You could try something like this: