Python read from file and remove non-ascii charact

2019-05-20 14:03发布

I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file.

import unicodedata
import codecs
infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')
outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')


for line in infile.readlines():
    for word in line.split():
        outfile.write(word+" ")
    outfile.write("\n")

infile.close()
outfile.close()

The only problem that I am facing is that with this code it does not print a new line to the second file (d_parsed). Any clues??

3条回答
Lonely孤独者°
2楼-- · 2019-05-20 14:35

use codecs to open the csv file and then you can avoid the non-ascii characters

 import codecs   
reader = codecs.open("example.csv",'r', encoding='ascii', errors='ignore')
    for reading in reader:
        print (reader)
查看更多
做个烂人
3楼-- · 2019-05-20 14:38

codecs.open() doesn't support universal newlines e.g., it doesn't translate \r\n to \n while reading on Windows.

Use io.open() instead:

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \
     io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:
    for line in infile:
        print(*line.split(), file=outfile)

btw, if you want to remove non-ascii characters, you should use ascii instead of utf-8.

If the input encoding is compatible with ascii (such as utf-8) then you could open the file in binary mode and use bytes.translate() to remove non-ascii characters:

#!/usr/bin/env python
nonascii = bytearray(range(0x80, 0x100))
with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:
    for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)
        outfile.write(line.translate(None, nonascii))

It doesn't normalize whitespace like the first code example.

查看更多
冷血范
4楼-- · 2019-05-20 14:39

From the docs for codecs.open:

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.

I presume you're using Windows, where the newline sequence is actually '\r\n'. A file opened in text mode will do the conversion from \n to \r\n automatically, but that doesn't happen with codecs.open.

Simply write "\r\n" instead of "\n" and it should work fine, at least on Windows.

查看更多
登录 后发表回答