Newline characters in non ASCII encoded files

I'm using Python 2.6 to read latin2 encoded file with windows line endings ('\r\n').

import codecs

file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='rt')
line = file.readline()
print(repr(line))

outputs : u'login: yabcok\n'

file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='r')
line = file.readline()
print(repr(line))

file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='rb')
line = file.readline()
print(repr(line))

outputs : u'password: l1x1%Dm\r\n'

My questions:

Why text mode is not the default? Documentation states otherwise. Is codecs module commonly used with binary files?
Why newline chars aren't stripped from readline() output? This is annoying and redundant.
Is there a way to specify newline character for files not ASCII encoded.

标签： python encoding file

2条回答

虎瘦雄心在

2楼-- · 2019-04-17 03:37

Are you sure that your examples are correct? The documentation of the codecs module says:

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.

On my system, with a Latin-2 encoded file + DOS line endings, there's no difference between "rt", "r" and "rb" (Disclaimer: I'm using 2.5 on Linux).

The documentation for open also mentions no "t" flag, so that behavior seems a little strange.

Newline characters are not stripped from lines because not all lines returned by readline may end in newlines. If the file does not end with a newline, the last line does not carry one. (I obviously can't come up with a better explanation).

Newline characters do not differ based on the encoding (at least not among the ones which use ASCII for 0-127), only based on the platform. You can specify "U" in the mode when opening the file and Python will detect any form of newline, either Windows, Mac or Unix.

0人赞添加讨论(0) 举报

做自己的国王

3楼-- · 2019-04-17 03:39

mode='rt'

'rt' isn't a real mode as such - that will do the same as 'r'.

Why text mode is not the default?

See Torsten's answer.

Also, if you are using anything but Windows, text mode files behave identically to binary files anyway.

You may instead be thinking of 'U'niversal newlines mode, which attempts to allow other platforms' text-mode files to work. Whilst it is possible to pass a 'U' flag to codecs.open, given the doc as outlined above I think it's bug. Certainly the results would go wrong on UTF-16 and some East Asian codecs, so don't rely on it.

Why newline chars aren't stripped from readline() output?

It is necessary to be able to tell whether the last line of the file ends with a trailing newline.

0人赞添加讨论(0) 举报

Newline characters in non ASCII encoded files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间