Python read from file and remove non-ascii charact

I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file.

import unicodedata
import codecs
infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')
outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')


for line in infile.readlines():
    for word in line.split():
        outfile.write(word+" ")
    outfile.write("\n")

infile.close()
outfile.close()

The only problem that I am facing is that with this code it does not print a new line to the second file (d_parsed). Any clues??

标签： python encoding character-encoding utf

3条回答

Lonely孤独者°

2楼-- · 2019-05-20 14:35

use codecs to open the csv file and then you can avoid the non-ascii characters

 import codecs   
reader = codecs.open("example.csv",'r', encoding='ascii', errors='ignore')
    for reading in reader:
        print (reader)

0人赞添加讨论(0) 举报

做个烂人

3楼-- · 2019-05-20 14:38

codecs.open() doesn't support universal newlines e.g., it doesn't translate \r\n to \n while reading on Windows.

Use io.open() instead:

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \
     io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:
    for line in infile:
        print(*line.split(), file=outfile)

btw, if you want to remove non-ascii characters, you should use ascii instead of utf-8.

If the input encoding is compatible with ascii (such as utf-8) then you could open the file in binary mode and use bytes.translate() to remove non-ascii characters:

#!/usr/bin/env python
nonascii = bytearray(range(0x80, 0x100))
with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:
    for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)
        outfile.write(line.translate(None, nonascii))

It doesn't normalize whitespace like the first code example.

0人赞添加讨论(0) 举报

冷血范

4楼-- · 2019-05-20 14:39

From the docs for codecs.open:

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.

I presume you're using Windows, where the newline sequence is actually '\r\n'. A file opened in text mode will do the conversion from \n to \r\n automatically, but that doesn't happen with codecs.open.

Simply write "\r\n" instead of "\n" and it should work fine, at least on Windows.

0人赞添加讨论(0) 举报

Python read from file and remove non-ascii charact

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间