I often deal with ascii tables containing few columns (normally less than 10) and up to tens of millions of lines. They look like
176.792 -2.30523 0.430772 32016 1 1 2
177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
...
177.403 -0.657246 0.432905 31152 1 1 1
I have a number of python codes that read, manipulate and save files. I have always used numpy.loadtxt
and numpy.savetxt
to do it. But numpy.loadtxt
takes at least 5-6Gb RAM to read 1Gb ascii file.
Yesterday I discovered Pandas, that solved almost all my problems: pandas.read_table
together with numpy.savetxt
improved the execution speed (of 2) of my scripts by a factor 3 or 4, while being very memory efficient.
All good until the point when I try to read in a file that contains a few commented lines at the beginning. The doc string (v=0.10.1.dev_f73128e) tells me that line commenting is not supported, and that will probably come. I think that this would be great: I really like the exclusion of line comments in numpy.loadtxt
.
Is there any idea on how this will become available? Would be also nice to have the possibility to skip those lines (the doc states that they will be returned as empy)
Not knowing how many comment lines I have in my files (I process thousands of them coming from different people), as now I open the file, count the number of lines starting with a comment at the beginning of the file:
def n_comments(fn, comment):
with open(fname, 'r') as f:
n_lines = 0
pattern = re.compile("^\s*{0}".format(comment))
for l in f:
if pattern.search(l) is None:
break
else:
n_lines += 1
return n_lines
and then
pandas.read_table(fname, skiprows=n_comments(fname, '#'), header=None, sep='\s')
Is there any better way (maybe within pandas) to do it?
Finally, before posting, I looked a bit at the code in pandas.io.parsers.py
to understand how pandas.read_table
works under the hood, but I got lost. Can anyone point me to the places that implement the reading of the files?
Thanks
EDIT2: I thought to get some improvement getting rid of some of the if
in @ThorstenKranz second implementation of FileWrapper
, but did get almost no improvements
class FileWrapper(file):
def __init__(self, comment_literal, *args):
super(FileWrapper, self).__init__(*args)
self._comment_literal = comment_literal
self._next = self._next_comment
def next(self):
return self._next()
def _next_comment(self):
while True:
line = super(FileWrapper, self).next()
if not line.strip()[0] == self._comment_literal:
self._next = self._next_no_comment
return line
def _next_no_comment(self):
return super(FileWrapper, self).next()
read_csv
and read_table
have a comment
option that will skip bytes starting from a comment character until the end of a line. If an entire line needs to be skipped, this isn't quite right because the parser will think that it's seen a line with no fields in it, then eventually see a valid data line and get confused.
I'd suggest using your workaround to determine the number of rows to skip manually in the file. It would be nice to have an option that enables automatically skipping lines when the entire line is a comment:
https://github.com/pydata/pandas/issues/2685
Implementing this well would require dipping into the C tokenizer code. It's not as bad as it might sound.
I found a compact solution by creating a class inheriting file
:
import pandas as pd
class FileWrapper(file):
def __init__(self, comment_literal, *args):
super(FileWrapper, self).__init__(*args)
self._comment_literal = comment_literal
def next(self):
while True:
line = super(FileWrapper, self).next()
if not line.startswith(self._comment_literal):
return line
df = pd.read_table(FileWrapper("#", "14276661.txt", "r"), delimiter=" ", header=None)
Atm, pandas (0.8.1) only uses the .next()
-method to iterate over file-like objects. We can overload this method and only return those lines that do not start with the dedicated comment-literal, in my example "#"
.
For input file:
176.792 -2.30523 0.430772 32016 1 1 2
# 177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
177.403 -0.657246 0.432905 31152 1 1 1
we get
>>> df
X.1 X.2 X.3 X.4 X.5 X.6 X.7
0 176.792 -2.305230 0.430772 32016 1 1 2
1 177.047 -1.549570 0.431853 31136 1 1 1
2 177.403 -0.657246 0.432905 31152 1 1 1
and for
176.792 -2.30523 0.430772 32016 1 1 2
177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
177.403 -0.657246 0.432905 31152 1 1 1
we get
>>> df
X.1 X.2 X.3 X.4 X.5 X.6 X.7
0 176.792 -2.305230 0.430772 32016 1 1 2
1 177.042 -1.877290 0.430562 32016 1 1 1
2 177.047 -1.549570 0.431853 31136 1 1 1
3 177.403 -0.657246 0.432905 31152 1 1 1
Instead of inheritance you could also use delegation, it is up to your flavor.
EDIT
I tried many other ways to improve on the performance. It's a tough job, though. I tried
- threading: Read file ahead in one thread with low-level io-operation and large chunks, split it into lines, enqueue these and only get from queue on
next()
- same with multiprocessing
- similar, multithreaded approach but using
readlines(size_hint)
mmap
for reading from file
The first three approaches surprisingly were slower, so no benefit. Using a mmap
significantly improved the performance. Here is the code:
class FileWrapper(file):
def __init__(self, comment_literal, *args):
super(FileWrapper, self).__init__(*args)
self._comment_literal = comment_literal
self._in_comment = True
self._prepare()
def __iter__(self):
return self
def next(self):
if self._in_comment:
while True:
line = self._get_next_line()
if line == "":
raise StopIteration()
if not line[0] == self._comment_literal:
self._in_comment = False
return line
line = self._get_next_line()
if line == "":
raise StopIteration()
return line
def _get_next_line(self):
return super(FileWrapper, self).next()
def _prepare(self):
pass
class MmapWrapper(file):
def __init__(self, fd, comment_literal = "#"):
self._mm = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
self._comment_literal = comment_literal
self._in_comment = True
def __iter__(self):
return self #iter(self._mm.readline, "")#self
def next(self):
if self._in_comment:
while True:
line = self._mm.readline()
if line == "":
raise StopIteration()
if not line[0] == self._comment_literal:
self._in_comment = False
return line
line = self._mm.readline()
if line == "":
raise StopIteration()
return line
if __name__ == "__main__":
t0 = time.time()
for i in range(10):
with open("1gram-d_1.txt", "r+b") as f:
df1 = pd.read_table(MmapWrapper(f.fileno()), delimiter="\t", header=None)
print "mmap:", time.time()-t0
t0 = time.time()
for i in range(10):
df2 = pd.read_table(FileWrapper("#", "1gram-d_1.txt", "r"), delimiter="\t", header=None)
print "Unbuffered:", time.time()-t0
print (df1==df2).mean()
gives as output
mmap: 35.3251504898
Unbuffered: 41.3274121284
X.1 1
X.2 1
X.3 1
X.4 1
I also implemented comment-checking only until the first non-comment line is found. This complies with your solution and further improves the performance.
For mmap
s, there exist some restrictions, though. If file sizes are huge, be sure to have enough RAM. If you're working on a 32bit OS, you won't be able to read files larger than 2GB.