I'm using Python 2.6 to read latin2 encoded file with windows line endings ('\r\n').
import codecs
file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='rt')
line = file.readline()
print(repr(line))
outputs : u'login: yabcok\n'
file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='r')
line = file.readline()
print(repr(line))
or
file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='rb')
line = file.readline()
print(repr(line))
outputs : u'password: l1x1%Dm\r\n'
My questions:
- Why text mode is not the default? Documentation states otherwise. Is
codecs
module commonly used with binary files? - Why newline chars aren't stripped from readline() output? This is annoying and redundant.
- Is there a way to specify newline character for files not ASCII encoded.
Are you sure that your examples are correct? The documentation of the codecs module says:
On my system, with a Latin-2 encoded file + DOS line endings, there's no difference between "rt", "r" and "rb" (Disclaimer: I'm using 2.5 on Linux).
The documentation for
open
also mentions no "t" flag, so that behavior seems a little strange.Newline characters are not stripped from lines because not all lines returned by
readline
may end in newlines. If the file does not end with a newline, the last line does not carry one. (I obviously can't come up with a better explanation).Newline characters do not differ based on the encoding (at least not among the ones which use ASCII for 0-127), only based on the platform. You can specify "U" in the mode when opening the file and Python will detect any form of newline, either Windows, Mac or Unix.
'rt' isn't a real mode as such - that will do the same as 'r'.
See Torsten's answer.
Also, if you are using anything but Windows, text mode files behave identically to binary files anyway.
You may instead be thinking of 'U'niversal newlines mode, which attempts to allow other platforms' text-mode files to work. Whilst it is possible to pass a 'U' flag to codecs.open, given the doc as outlined above I think it's bug. Certainly the results would go wrong on UTF-16 and some East Asian codecs, so don't rely on it.
It is necessary to be able to tell whether the last line of the file ends with a trailing newline.