Python, read CRLF text file as is, with CRLF

2019-02-16 14:57发布

with open(fn, 'rt') as f:
    lines = f.readlines()

This reads CR LF text file (WinXP, Py 2.6) with LF line ends. So lines contain '\n' ends. How to get lines as is:

for CRLF file get lines with '\n\r' ends
for LF file get lines with '\n' ends

回答1:

Instead of the built-in open() function, use io.open(). This gives you more control over how newlines are handled with the newline argument:

import io

with io.open(fn, 'rt', newline='') as f:
    lines = f.readlines()

Setting newline to the empty string, leaves universal newline support enabled but returns line endings untranslated; you can still use .readlines() to find lines terminated with any of the legal line terminators but the data returned is exactly that found in the file:

On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.

Emphasis mine.

This is different from opening the file in binary mode, where .readlines() will only split the file on \n characters. For a file with \r line endings or mixed line endings, this means that lines are not going to be split correctly.

Demo:

>>> import io
>>> open('test.txt', 'wb').write('One\nTwo\rThree\r\n')
>>> open('test.txt', 'rb').readlines()
['One\n', 'Two\rThree\r\n']
>>> io.open('test.txt', 'r', newline='').readlines()
[u'One\n', u'Two\r', u'Three\r\n']

Note that io.open() also decodes file contents to unicode values.