I have a plain ASCII file. When I try to open it with codecs.open(..., "utf-8")
, I am unable to read single characters. ASCII is a subset of UTF-8, so why can't codecs
open such a file in UTF-8 mode?
# test.py
import codecs
f = codecs.open("test.py", "r", "utf-8")
# ASCII is supposed to be a subset of UTF-8:
# http://www.fileformat.info/info/unicode/utf8.htm
assert len(f.read(1)) == 1 # OK
f.readline()
c = f.read(1)
print len(c)
print "'%s'" % c
assert len(c) == 1 # fails
# max% p test.py
# 63
# '
# import codecs
#
# f = codecs.open("test.py", "r", "utf-8")
#
# # ASC'
# Traceback (most recent call last):
# File "test.py", line 15, in <module>
# assert len(c) == 1 # fails
# AssertionError
# max%
system:
Linux max 4.4.0-89-generic #112~14.04.1-Ubuntu SMP Tue Aug 1 22:08:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Of course it works with regular open
. It also works if I remove the "utf-8"
option. Also what does 63
mean? That's like the middle of the 3rd line. I don't get it.
Found your problem:
When passed an encoding, codecs.open
returns a StreamReaderWriter
, which is really just a wrapper around (not a subclass of; it's a "composed of" relationship, not inheritance) StreamReader
and StreamWriter
. Problem is:
StreamReaderWriter
provides a "normal" read
method (that is, it takes a size
parameter and that's it)
- It delegates to the internal
StreamReader.read
method, where the size
argument is only a hint as to the number of bytes to read, but not a limit; the second argument, chars
, is a strict limiter, but StreamReaderWriter
never passes that argument along (it doesn't accept it)
- When
size
hinted, but not capped using chars
, if StreamReader
has buffered data, and it's large enough to match the size
hint StreamReader.read
blindly returns the contents of the buffer, rather than limiting it in any way based on the size
hint (after all, only chars
imposes a maximum return size)
The API of StreamReader.read
and the meaning of size
/chars
for the API is the only documented thing here; the fact that codecs.open
returns StreamReaderWriter
is not contractual, nor is the fact that StreamReaderWriter
wraps StreamReader
, I just used ipython
's ??
magic to read the source code of the codecs
module to verify this behavior. But documented or not, that's what it's doing (feel free to read the source code for StreamReaderWriter
, it's all Python level, so it's easy).
The best solution is to switch to io.open
, which is faster and more correct in every standard case (codecs.open
supports the weirdo codecs that don't convert between bytes
[Py2 str
] and str
[Py2 unicode
], but rather, handle str
to str
or bytes
to bytes
encodings, but that's an incredibly limited use case; most of the time, you're converting between bytes
and str
). All you need to do is import io
instead of codecs
, and change the codecs.open
line to:
f = io.open("test.py", encoding="utf-8")
The rest of your code can remain unchanged (and will likely run faster to boot).
As an alternative, you could explicitly bypass StreamReaderWriter
to get the StreamReader
's read
method and pass the limiting argument directly, e.g. change:
c = f.read(1)
to:
# Pass second, character limiting argument after size hint
c = f.reader.read(6, 1) # 6 is sort of arbitrary; should ensure a full char read in one go
I suspect Python Bug #8260, which covers intermingling readline
and read
on codecs.open
created file objects, applies here, officially, it's "fixed", but if you read the comments, the fix wasn't complete (and may not be possible to complete given the documented API); arbitrarily weird combinations of read
and readline
will be able to break it.
Again, just use io.open
; as long as you're on Python 2.6 or higher, it's available, and it's just plain better.