What makes parsing a text file in 'r' mode more convenient than parsing it in 'rb' mode? Especially when the text file in question may contain non-ASCII characters.
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
This depends a little bit on what version of Python you're using. In Python 2, Chris Drappier's answer applies.
In Python 3, its a different (and more consistent) story: in text mode (
'r'
), Python will parse the file according to the text encoding you give it (or, if you don't give one, a platform-dependent default), andread()
will give you astr
. In binary ('rb'
) mode, Python does not assume that the file contains things that can reasonably be parsed as characters, andread()
gives you abytes
object.Also, in Python 3, the universal newlines (the translating between
'\n'
and platform-specific newline conventions so you don't have to care about them) is available for text-mode files on any platform, not just Windows.For clarification and to answer Agostino's comment/question (I don't have sufficient reputation to comment so bear with me stating this as an answer...):
In Python 2 no line end modification happens, neither in text nor binary mode - as has been stated before, in Python 2 Chris Drappier's answer applies (please note that its link nowadays points to the 3.x Python docs but Chris' quoted text is of course from the Python 2 input and output tutorial)
So no, it is not true that opening a file in text mode with Python 2 on non-Windows does any line end modification:
It is however possible to open the file in universal newline mode in Python 2, which does exactly perform said line end mod:
(the universal newline mode specifier is deprecated as of Python 3.x)
On Python 3, on the other hand, platform-specific line ends do get normalized to '\n' when reading a file in text mode, and '\n' gets converted to the current platform's default line end when writing in text mode (in addition to the bytes<->unicode<->bytes decoding/encoding going on in text mode). E.g. reading a Dos/Win CRLF-line-ended file on Linux will normalize the line ends to '\n'.
The difference lies in how the end-of-line (EOL) is handled. Different operating systems use different characters to mark EOL -
\n
in Unix,\r
in Mac versions prior to OS X,\r\n
in Windows. When a file is opened in text mode, when the file is read, Python replaces the OS specific end-of-line character read from the file with just\n
. And vice versa, i.e. when you try to write\n
to a file opened in text mode, it is going to write the OS specific EOL character. You can find what your OS default EOL by checkingos.linesep
.When a file is opened in binary mode, no mapping takes place. What you read is what you get. Remember, text mode is the default mode. So if you are handling non-text files (images, video, etc.), make sure you open the file in binary mode, otherwise you’ll end up messing up the file by introducing (or removing) some bytes.
Python also has a universal newline mode. When a file is opened in this mode, Python maps all of the characters
\r
,\n
and\r\n
to\n
.from the documentation: