How to wrap a Python text stream to replace string

2019-08-04 08:43发布

问题:

Given how convoluted my solution seems to be, I am probably doing it all wrong.

Basically, I am trying to replace strings on the fly in a text stream (e.g. open('filename', 'r') or io.StringIO(text)). The context is that I'm trying to let pandas.read_csv() handle "Infinity" as "inf" instead of choking on it.

I do not want to slurp the whole file in memory (it can be big, and even if the resulting DataFrame will live in memory, no need to have the whole text file too). Efficiency is a concern. So I'd like to keep using read(size) as the main way to get text in (no readline which is quite slower). The difficulty comes from the cases where read() might return a block of text that ends in the middle of one of the strings we'd like to replace.

Anyway, below is what I've got so far. It handles the conditions I've thrown at it so far (lines longer than size, search strings at the boundary of some read block), but I'm wondering if there is something simpler.

Oh, BTW, I don't handle anything else than calls to read().

class ReplaceIOFile(io.TextIOBase):
    def __init__(self, iobuffer, old_list, new_list):
        self.iobuffer = iobuffer
        self.old_list = old_list
        self.new_list = new_list
        self.buf0 = ''
        self.buf1 = ''
        self.sub_has_more = True

    def read(self, size=None):
        if size is None:
            size = 2**16
        while len(self.buf0) < size and self.sub_has_more:
            eol = 0
            while eol <= 0:
                txt = self.iobuffer.read(size)
                self.buf1 += txt
                if len(txt) < size:
                    self.sub_has_more = False
                    eol = len(self.buf1) + 1
                else:
                    eol = self.buf1.rfind('\n') + 1
            txt, self.buf1 = self.buf1[:eol], self.buf1[eol:]
            for old, new in zip(self.old_list, self.new_list):
                txt = txt.replace(old, new)
            self.buf0 += txt
        val, self.buf0 = self.buf0[:size], self.buf0[size:]
        return val

Example:

text = """\
name,val
a,1.0
b,2.0
e,+Infinity
f,-inf
"""

size = 4  # or whatever -- I tried 1,2,4,10,100,2**16
with ReplaceIOFile(io.StringIO(text), ['Infinity'], ['inf']) as f:
    while True:
        buf = f.read(size)
        print(buf, end='')
        if len(buf) < size:
            break

Output:

name,val
a,1.0
b,2.0
e,+inf
f,-inf

So for my application:

# x = pd.read_csv(io.StringIO(text), dtype=dict(val=np.float64))  ## crashes
x = pd.read_csv(ReplaceIOFile(io.StringIO(text), ['Infinity'], ['inf']), dtype=dict(val=np.float64))

Out:

  name       val
0    a  1.000000
1    b  2.000000
2    e       inf
3    f      -inf