I have a file in UTF-8, where some lines contain the U+2028 Line Separator character (http://www.fileformat.info/info/unicode/char/2028/index.htm). I don't want it to be treated as a line break when I read lines from the file. Is there a way to exclude it from separators when I iterate over the file or use readlines()? (Besides reading the entire file into a string and then splitting by \n.) Thank you!
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
Thanks to everyone for answering. I think I know why you might not have been able to replicate this.I just realized that it happens if I decode the file when opening, as in:
The lines are not separated on u2028, if I open the file first and then decode individual lines:
(I'm using Python 2.6 on Windows. The file was originally UTF16LE and then it was converted into UTF8).
This is very interesting, I guess I won't be using codecs.open much from now on :-).
I can't duplicate this behaviour in python 2.5, 2.6 or 3.0 on mac os x - U+2028 is always treated as non-endline. Could you go into more detail about where you see this error?
That said, here is a subclass of the "file" class that might do what you want:
If you use Python 3.0 (note that I don't, so I can't test), according to the documentation you can pass an optional
newline
parameter toopen
to specifify which line seperator to use. However, the documentation doesn't mention U+2028 at all (it only mentions\r
,\n
, and\r\n
as line seperators), so it's actually a suprise to me that this even occurs (although I can confirm this even with Python 2.6).I couldn't reproduce that behavior but here's a naive solution that just merges readline results until they don't end with U+2028.
The codecs module is doing the RIGHT thing. U+2028 is named "LINE SEPARATOR" with the comment "may be used to represent this semantic unambiguously". So treating it as a line separator is sensible.
Presumably the creator would not have put the U+2028 characters there without good reason ... does the file have u"\n" as well? Why do you want lines not to be split on U+2028?