I have a large array of numbers written in a CSV file and need to load only a slice of that array. Conceptually I want to call np.genfromtxt()
and then row-slice the resulting array, but
- the file is so large that may not to fit in RAM
- the number of relevant rows might be small, so there is no need to parse every line.
MATLAB has the function textscan()
that can take a file descriptor and read only a chunk of the file. Is there anything like that in NumPy?
For now, I defined the following function that reads only the lines that satisfy the given condition:
def genfromtxt_cond(fname, cond=(lambda str: True)):
res = []
with open(fname) as file:
for line in file:
if cond(line):
res.append([float(s) for s in line.split()])
return np.array(res, dtype=np.float64)
There are several problems with this solution:
- not general: supports only the float type, while
genfromtxt
detects the types, which may vary from column to column; also missing values, converters, skipping, etc.; - not efficient: when the condition is difficult, every line may be parsed twice, also the used data structure and reading bufferization may be suboptimal;
- requires writing code.
Is there a standard function that implements filtering, or some counterpart of MATLAB’s textscan
?
If you pass a list of types (the format condition), use a try block and use yield to use genfromtxt as a generator, we should be able to replicate
textscan()
.Edit: I forgot the except block. It runs okay now and you can use genfromtext as a generator like so (using a random CSV log I have sitting around):
I should probably note that I'm using
zip
to zip together the comma split line and the formatSpec which will tuplify the two lists (stopping when one of the lists runs out of items) so we can iterate over them together, avoiding a loop dependent onlen(line)
or something like that.Trying to demonstrate comment to OP.
Output
I can think of two approaches that provide some of the functionality you are asking for:
To read a file either in chunks / or in strides of n-lines / etc.:
You can pass a
generator
to numpy.genfromtxt as well as to numpy.loadtxt. This way you can load a large dataset from a textfile memory-efficiently while retaining all the convenient parsing features of the two functions.To read data only from lines that match a criterion that can be expressed as a regex:
You can use numpy.fromregex and use a
regular expression
to precisely define which tokens from a given line in the input file should be loaded. Lines not matching the pattern will be ignored.To illustrate the two approaches, I'm going to use an example from my research context.
I often need to load files with the following structure:
These files can be huge (GBs) and I'm only interested in the numerical data. All data blocks have the same size --
6
in this example -- and they are always separated by two lines. So thestride
of the blocks is8
.Using the first approach:
First I'm going to define a generator that filters out the undesired lines:
Then I open the file, create a
filter_lines
-generator (here I need to know thestride
), and pass that generator togenfromtxt
:This works like a breeze. Note that I'm able to use
usecols
to get rid of the first column of the data. In the same way, you could use all the other features ofgenfromtxt
-- detecting the types, varying types from column to column, missing values, converters, etc.In this example
data.shape
was(204000, 3)
while the original file consisted of272000
lines.Here the
generator
is used to filter homogenously strided lines but one can likewise imagine it filtering out inhomogenous blocks of lines based on (simple) criteria.Using the second approach:
Here's the
regexp
I'm going to use:Groups -- i.e. inside
()
-- define the tokens to be extracted from a given line. Next,fromregex
does the job and ignores lines not matching the pattern:The result is exactly the same as in the first approach.