I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file.
import unicodedata
import codecs
infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')
outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')
for line in infile.readlines():
for word in line.split():
outfile.write(word+" ")
outfile.write("\n")
infile.close()
outfile.close()
The only problem that I am facing is that with this code it does not print a new line to the second file (d_parsed). Any clues??
use codecs to open the csv file and then you can avoid the non-ascii characters
codecs.open()
doesn't support universal newlines e.g., it doesn't translate\r\n
to\n
while reading on Windows.Use
io.open()
instead:btw, if you want to remove non-ascii characters, you should use
ascii
instead ofutf-8
.If the input encoding is compatible with ascii (such as utf-8) then you could open the file in binary mode and use
bytes.translate()
to remove non-ascii characters:It doesn't normalize whitespace like the first code example.
From the docs for
codecs.open
:I presume you're using Windows, where the newline sequence is actually
'\r\n'
. A file opened in text mode will do the conversion from\n
to\r\n
automatically, but that doesn't happen withcodecs.open
.Simply write
"\r\n"
instead of"\n"
and it should work fine, at least on Windows.