I'm new to python, coming from matlab. I have a large sparse matrix saved in matlab v7.3 (HDF5) format. I've so far found two ways of loading in the file, using h5py
and tables
. However operating on the matrix seems to be extremely slow after either. For example, in matlab:
>> whos
Name Size Bytes Class Attributes
M 11337x133338 77124408 double sparse
>> tic, sum(M(:)); toc
Elapsed time is 0.086233 seconds.
Using tables:
t = time.time()
sum(f.root.M.data)
elapsed = time.time() - t
print elapsed
35.929461956
Using h5py:
t = time.time()
sum(f["M"]["data"])
elapsed = time.time() - t
print elapsed
(I gave up waiting ...)
[EDIT]
Based on the comments from @bpgergo, I should add that I've tried converting the result loaded in by h5py
(f
) into a numpy
array or a scipy
sparse array in the following two ways:
from scipy import sparse
A = sparse.csc_matrix((f["M"]["data"], f["M"]["ir"], f["tfidf"]["jc"]))
or
data = numpy.asarray(f["M"]["data"])
ir = numpy.asarray(f["M"]["ir"])
jc = numpy.asarray(f["M"]["jc"])
A = sparse.coo_matrix(data, (ir, jc))
but both of these operations are extremely slow as well.
Is there something I'm missing here?
The final answer for posterity:
Most of your problem is that you're using python
sum
on what's effectively a memory-mapped array (i.e. it's on disk, not in memory).First off, you're comparing the time it takes to read things from disk to the time it takes to read things in memory. Load the array into memory first, if you want to compare to what you're doing in matlab.
Secondly, python's builtin
sum
is very inefficent for numpy arrays. (Or, rather, iterating through every item of a numpy array independently is very slow, which is what python's builtinsum
is doing.) Usenumpy.sum(yourarray)
oryourarray.sum()
instead for numpy arrays.As an example:
(Using
h5py
, because I'm more familiar with it.)You're missing numpy http://www.scipy.org/NumPy_for_Matlab_Users