I'm trying to take a Unicode file stream, which contains odd characters, and wrap it with a stream reader that will convert it to Ascii, ignoring or replacing all characters that can't be encoded.
My stream looks like:
"EventId","Rate","Attribute1","Attribute2","(。・ω・。)ノ"
...
My attempt to alter the stream on the fly looks like this:
import chardet, io, codecs
with open(self.csv_path, 'rb') as rawdata:
detected = chardet.detect(rawdata.read(1000))
detectedEncoding = detected['encoding']
with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
csv_ascii_stream = codecs.getreader('ascii')(csv_file, errors='ignore')
log( csv_ascii_stream.read() )
The result on the log
line is: UnicodeEncodeError: 'ascii' codec can't encode characters in position 36-40: ordinal not in range(128)
even though I explicitly constructed the StreamReader with errors='ignore'
I would like the resulting stream (when read) to come out like this:
"EventId","Rate","Attribute1","Attribute2","(?????)?"
...
or alternatively, "EventId","Rate","Attribute1","Attribute2","()"
(using 'ignore'
instead of 'replace'
)
Why is the Exception happening anyway?
I've seen plenty of problems/solutions for decoding strings, but my challenge is to change the stream as it's being read (using .next()
), because the file is potentially too large to be loaded into memory all at once using .read()
You're mixing up the encode and decode sides.
For decoding, you're doing fine. You open it as binary data, chardet
the first 1K, then reopen in text mode using the detected encoding.
But then you're trying to further decode that already-decoded data as ASCII, by using codecs.getreader
. That function returns a StreamReader
, which decodes data from a stream. That isn't going to work. You need to encode that data to ASCII.
But it's not clear why you're using a codecs
stream decoder or encoder in the first place, when all you want to do is encode a single chunk of text in one go so you can log it. Why not just call the encode
method?
log(csv_file.read().encode('ascii', 'ignore'))
If you want something that you can use as a lazy iterable of lines, you could build something fully general, but it's a lot simpler to just do something like the UTF8Recorder
example in the csv
docs:
class AsciiRecoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("ascii", "ignore")
Or, even more simply:
with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
csv_ascii_stream = (line.encode('ascii', 'ignore') for line in csv_file)
I'm a little late to the party with this, but here's an alternate solution, using codecs.StreamRecoder
:
from codecs import getencoder, getdecoder, getreader, getwriter, StreamRecoder
with io.open(self.csv_path, 'rb') as f:
csv_ascii_stream = StreamRecoder(f,
getencoder('ascii'),
getdecoder(detectedEncoding),
getreader(detectedEncoding),
getwriter('ascii'),
errors='ignore')
print(csv_ascii_stream.read())
I guess you may want to use this if you need the flexibility to be able to call read()
/readlines()
/seek()
/tell()
etc. on the stream that gets returned. If you just need to iterate over the stream, the generator expression abarnert provided is a bit more concise.