I grabbed the KDD track1 dataset from Kaggle and decided to load a ~2.5GB 3-column CSV file into memory, on my 16GB high-memory EC2 instance:
data = np.loadtxt('rec_log_train.txt')
the python session ate up all my memory (100%), and then got killed.
I then read the same file using R (via read.table) and it used less than 5GB of ram, which collapsed to less than 2GB after I called the garbage collector.
My question is why did this fail under numpy, and what's the proper way of reading a file into memory. Yes I can use generators and avoid the problem, but that's not the goal.
import pandas, re, numpy as np
def load_file(filename, num_cols, delimiter='\t'):
data = None
try:
data = np.load(filename + '.npy')
except:
splitter = re.compile(delimiter)
def items(infile):
for line in infile:
for item in splitter.split(line):
yield item
with open(filename, 'r') as infile:
data = np.fromiter(items(infile), float64, -1)
data = data.reshape((-1, num_cols))
np.save(filename, data)
return pandas.DataFrame(data)
This reads in the 2.5GB file, and serializes the output matrix. The input file is read in "lazily", so no intermediate data-structures are built and minimal memory is used. The initial load takes a long time, but each subsequent load (of the serialized file) is fast. Please let me if you have tips!
Try out recfile for now: http://code.google.com/p/recfile/ . There are a couple of efforts I know of to make a fast C/C++ file reader for NumPy; it's on my short todo list for pandas because it causes problems like these. Warren Weckesser also has a project here: https://github.com/WarrenWeckesser/textreader . I don't know which one is better, try them both?
You can try numpy.fromfile
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html