I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:
def getdata(filename, criteria):
data=[]
for criterion in criteria:
data.append(getstuff(filename, criteron))
return data
def getstuff(filename, criterion):
import csv
data=[]
with open(filename, "rb") as csvfile:
datareader=csv.reader(csvfile)
for row in datareader:
if row[3]=="column header":
data.append(row)
elif len(data)<2 and row[3]!=criterion:
pass
elif row[3]==criterion:
data.append(row)
else:
return data
The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.
My questions are:
How can I manage to get this to work with the bigger files?
Is there any way I can make it faster?
My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).
Thanks very much for any help!
what worked for me was and is superfast is
I do a fair amount of vibration analysis and look at large data sets (tens and hundreds of millions of points). My testing showed the pandas.read_csv() function to be 20 times faster than numpy.genfromtxt(). And the genfromtxt() function is 3 times faster than the numpy.loadtxt(). It seems that you need pandas for large data sets.
I posted the code and data sets I used in this testing on a blog discussing MATLAB vs Python for vibration analysis.
here's another solution for Python3:
here
datareader
is a generator function.You are reading all rows into a list, then processing that list. Don't do that.
Process your rows as you produce them. If you need to filter the data first, use a generator function:
I also simplified your filter test; the logic is the same but more concise.
Because you are only matching a single sequence of rows matching the criterion, you could also use:
You can now loop over
getstuff()
directly. Do the same ingetdata()
:Now loop directly over
getdata()
in your code:You now only hold one row in memory, instead of your thousands of lines per criterion.
yield
makes a function a generator function, which means it won't do any work until you start looping over it.Generator is a good solution. And actually, you can add a while True: in the code(before opening a csv), making it iterable for infinitely many loops.
For example, on mnist dataset:
I was recently trying to solve the same problem but found python pandas package to be reasonably efficient.
You may want to check here, http://pandas.pydata.org/
Pandas is a high performance data analysis library for big data.