Python pandas: read file skipping commented

2020-08-01 09:00发布

问题:

I often deal with ascii tables containing few columns (normally less than 10) and up to tens of millions of lines. They look like

176.792 -2.30523 0.430772 32016 1 1 2 
177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
...
177.403 -0.657246 0.432905 31152 1 1 1

I have a number of python codes that read, manipulate and save files. I have always used numpy.loadtxt and numpy.savetxt to do it. But numpy.loadtxt takes at least 5-6Gb RAM to read 1Gb ascii file.

Yesterday I discovered Pandas, that solved almost all my problems: pandas.read_table together with numpy.savetxt improved the execution speed (of 2) of my scripts by a factor 3 or 4, while being very memory efficient.

All good until the point when I try to read in a file that contains a few commented lines at the beginning. The doc string (v=0.10.1.dev_f73128e) tells me that line commenting is not supported, and that will probably come. I think that this would be great: I really like the exclusion of line comments in numpy.loadtxt. Is there any idea on how this will become available? Would be also nice to have the possibility to skip those lines (the doc states that they will be returned as empy)

Not knowing how many comment lines I have in my files (I process thousands of them coming from different people), as now I open the file, count the number of lines starting with a comment at the beginning of the file:

def n_comments(fn, comment):
    with open(fname, 'r') as f:
        n_lines = 0
        pattern = re.compile("^\s*{0}".format(comment))
        for l in f:
            if pattern.search(l) is None:
                break
            else:
                n_lines += 1
    return n_lines

and then

pandas.read_table(fname, skiprows=n_comments(fname, '#'), header=None, sep='\s')

Is there any better way (maybe within pandas) to do it?

Finally, before posting, I looked a bit at the code in pandas.io.parsers.py to understand how pandas.read_table works under the hood, but I got lost. Can anyone point me to the places that implement the reading of the files?

Thanks

EDIT2: I thought to get some improvement getting rid of some of the if in @ThorstenKranz second implementation of FileWrapper, but did get almost no improvements

class FileWrapper(file):
    def __init__(self, comment_literal, *args):
        super(FileWrapper, self).__init__(*args)
        self._comment_literal = comment_literal
        self._next = self._next_comment

    def next(self):
        return self._next()

    def _next_comment(self):
        while True:
            line = super(FileWrapper, self).next()
            if not line.strip()[0] == self._comment_literal:
                self._next = self._next_no_comment
                return line
    def _next_no_comment(self):
        return super(FileWrapper, self).next()

回答1:

read_csv and read_table have a comment option that will skip bytes starting from a comment character until the end of a line. If an entire line needs to be skipped, this isn't quite right because the parser will think that it's seen a line with no fields in it, then eventually see a valid data line and get confused.

I'd suggest using your workaround to determine the number of rows to skip manually in the file. It would be nice to have an option that enables automatically skipping lines when the entire line is a comment:

https://github.com/pydata/pandas/issues/2685

Implementing this well would require dipping into the C tokenizer code. It's not as bad as it might sound.



回答2:

I found a compact solution by creating a class inheriting file:

import pandas as pd

class FileWrapper(file):
    def __init__(self, comment_literal, *args):
        super(FileWrapper, self).__init__(*args)
        self._comment_literal = comment_literal

    def next(self):
        while True:
            line = super(FileWrapper, self).next()
            if not line.startswith(self._comment_literal):
                return line

df = pd.read_table(FileWrapper("#", "14276661.txt", "r"), delimiter=" ", header=None)

Atm, pandas (0.8.1) only uses the .next()-method to iterate over file-like objects. We can overload this method and only return those lines that do not start with the dedicated comment-literal, in my example "#".

For input file:

176.792 -2.30523 0.430772 32016 1 1 2 
# 177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
177.403 -0.657246 0.432905 31152 1 1 1

we get

>>> df
       X.1       X.2       X.3    X.4  X.5  X.6  X.7
0  176.792 -2.305230  0.430772  32016    1    1    2
1  177.047 -1.549570  0.431853  31136    1    1    1
2  177.403 -0.657246  0.432905  31152    1    1    1

and for

176.792 -2.30523 0.430772 32016 1 1 2 
177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
177.403 -0.657246 0.432905 31152 1 1 1

we get

>>> df
       X.1       X.2       X.3    X.4  X.5  X.6  X.7
0  176.792 -2.305230  0.430772  32016    1    1    2
1  177.042 -1.877290  0.430562  32016    1    1    1
2  177.047 -1.549570  0.431853  31136    1    1    1
3  177.403 -0.657246  0.432905  31152    1    1    1

Instead of inheritance you could also use delegation, it is up to your flavor.

EDIT I tried many other ways to improve on the performance. It's a tough job, though. I tried

  • threading: Read file ahead in one thread with low-level io-operation and large chunks, split it into lines, enqueue these and only get from queue on next()
  • same with multiprocessing
  • similar, multithreaded approach but using readlines(size_hint)
  • mmap for reading from file

The first three approaches surprisingly were slower, so no benefit. Using a mmap significantly improved the performance. Here is the code:

class FileWrapper(file):
    def __init__(self, comment_literal, *args):
        super(FileWrapper, self).__init__(*args)
        self._comment_literal = comment_literal
        self._in_comment = True
        self._prepare()

    def __iter__(self):
        return self

    def next(self):
        if self._in_comment:
            while True:
                line = self._get_next_line()
                if line == "":
                    raise StopIteration()
                if not line[0] == self._comment_literal:
                    self._in_comment = False
                    return line
        line = self._get_next_line()
        if line == "":
            raise StopIteration()
        return line

    def _get_next_line(self):
        return super(FileWrapper, self).next()

    def _prepare(self):
        pass

class MmapWrapper(file):
    def __init__(self, fd, comment_literal = "#"):
        self._mm = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
        self._comment_literal = comment_literal
        self._in_comment = True

    def __iter__(self):
        return self #iter(self._mm.readline, "")#self

    def next(self):
        if self._in_comment:
            while True:
                line = self._mm.readline()
                if line == "":
                    raise StopIteration()
                if not line[0] == self._comment_literal:
                    self._in_comment = False
                    return line
        line = self._mm.readline()
        if line == "":
            raise StopIteration()
        return line

if __name__ == "__main__":
    t0 = time.time()    
    for i in range(10):    
        with open("1gram-d_1.txt", "r+b") as f:
            df1 = pd.read_table(MmapWrapper(f.fileno()), delimiter="\t", header=None)
    print "mmap:", time.time()-t0

    t0 = time.time()    
    for i in range(10):    
        df2 = pd.read_table(FileWrapper("#", "1gram-d_1.txt", "r"), delimiter="\t", header=None)
    print "Unbuffered:", time.time()-t0

    print (df1==df2).mean()

gives as output

mmap: 35.3251504898
Unbuffered: 41.3274121284
X.1    1
X.2    1
X.3    1
X.4    1

I also implemented comment-checking only until the first non-comment line is found. This complies with your solution and further improves the performance.

For mmaps, there exist some restrictions, though. If file sizes are huge, be sure to have enough RAM. If you're working on a 32bit OS, you won't be able to read files larger than 2GB.



标签: python io pandas