I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig
encoding. See below:
>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
The following text ends up to the file:
<BOM>123
<BOM>123
Isn't that a bug? This is so not logical. Could anyone explain to me why it was done so? Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?
No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.
Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with
codecs.open()
) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.Only use
utf-8-sig
on a new file; the codec will always write the BOM out whenever you use it.If you are working directly with files, you can test for the start yourself; use
utf-8
instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:I used the newer
io.open()
instead ofcodecs.open()
;io
is the new I/O framework developed for Python 3, and is more robust thancodecs
for handling encoded files, in my experience.Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.
The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).