Given how convoluted my solution seems to be, I am probably doing it all wrong.
Basically, I am trying to replace strings on the fly in a text stream (e.g. open('filename', 'r')
or io.StringIO(text)
). The context is that I'm trying to let pandas.read_csv()
handle "Infinity" as "inf" instead of choking on it.
I do not want to slurp the whole file in memory (it can be big, and even if the resulting DataFrame will live in memory, no need to have the whole text file too). Efficiency is a concern. So I'd like to keep using read(size)
as the main way to get text in (no readline
which is quite slower). The difficulty comes from the cases where read()
might return a block of text that ends in the middle of one of the strings we'd like to replace.
Anyway, below is what I've got so far. It handles the conditions I've thrown at it so far (lines longer than size, search strings at the boundary of some read block), but I'm wondering if there is something simpler.
Oh, BTW, I don't handle anything else than calls to read()
.
class ReplaceIOFile(io.TextIOBase):
def __init__(self, iobuffer, old_list, new_list):
self.iobuffer = iobuffer
self.old_list = old_list
self.new_list = new_list
self.buf0 = ''
self.buf1 = ''
self.sub_has_more = True
def read(self, size=None):
if size is None:
size = 2**16
while len(self.buf0) < size and self.sub_has_more:
eol = 0
while eol <= 0:
txt = self.iobuffer.read(size)
self.buf1 += txt
if len(txt) < size:
self.sub_has_more = False
eol = len(self.buf1) + 1
else:
eol = self.buf1.rfind('\n') + 1
txt, self.buf1 = self.buf1[:eol], self.buf1[eol:]
for old, new in zip(self.old_list, self.new_list):
txt = txt.replace(old, new)
self.buf0 += txt
val, self.buf0 = self.buf0[:size], self.buf0[size:]
return val
Example:
text = """\
name,val
a,1.0
b,2.0
e,+Infinity
f,-inf
"""
size = 4 # or whatever -- I tried 1,2,4,10,100,2**16
with ReplaceIOFile(io.StringIO(text), ['Infinity'], ['inf']) as f:
while True:
buf = f.read(size)
print(buf, end='')
if len(buf) < size:
break
Output:
name,val
a,1.0
b,2.0
e,+inf
f,-inf
So for my application:
# x = pd.read_csv(io.StringIO(text), dtype=dict(val=np.float64)) ## crashes
x = pd.read_csv(ReplaceIOFile(io.StringIO(text), ['Infinity'], ['inf']), dtype=dict(val=np.float64))
Out:
name val
0 a 1.000000
1 b 2.000000
2 e inf
3 f -inf