I am creating a co-occurring matrix, which is of size 1M by 1M integer numbers.
After the matrix is created, the only operation I will do on it is to get top N values per each row (or column. as it is a symmetric matrix).
I have to create matrix as sparse to be able to fit it in memory. I read input data from a big file, and update co-occurance of two indexes (row, col) incrementally.
The sample code for Sparse dok_matrix specifies that I should declare the size of matrix before hand. I know the upper boundary for my matrix (1m by 1m), but in reality it might has less than that.
Do I have to specify the size beforehand, or can i just create it incrementally?
import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
for j in range(5):
S[i, j] = i + j # Update element
A SO question from a couple of days ago, creating sparse matrix of unknown size, talks about creating a sparse matrix from data read from a file. There the OP wanted to use lil
format; I recommended building the input arrays for a coo
format.
In other SO questions I've found that adding values to a plain dictionary is faster than adding them to a dok
matrix - even though a dok
is a dictionary subclass. There's quite a bit of over head in the dok
indexing method. In some cases I suggested building a dict with a tuple key, and using update
to add the values to a defined dok
. But I suspect in your case the coo
route is better.
dok
and lil
are the best formats for incremental construction, but neither is that great compared to python list and dict methods.
As for the top N values
of each row, I recall exploring that, but back some time, so can't off hand pull up a good SO question. You probably want a row oriented format such as lil
or csr
.
As for the question - 'do you need to specify the size on creation'. Yes. Because a sparse matrix, regardless of format, only stores nonzero values, there's little harm in creating a matrix that is too large.
I can't think of anything in a dok
or coo
format that hinges on the shape
- at least not in terms of data storage or creation. lil
and csr
will have some extra values. If you really need to explore this, read up on how values are stored, and play with small matrices.
==================
It looks like all the code for dok
format is Python in
/usr/lib/python3/dist-packages/scipy/sparse/dok.py
Scanning that file, I see that dok
does have a resize
method
d.resize?
Signature: d.resize(shape)
Docstring:
Resize the matrix in-place to dimensions given by 'shape'.
Any non-zero elements that lie outside the new shape are removed.
File: /usr/lib/python3/dist-packages/scipy/sparse/dok.py
Type: method
So if you want to initial the matrix to 1M x 1M
and resize to 100 x 100
you can do so - it will step through all the keys to make sure there aren't any outside the new range. So it isn't cheap, even though the main action is to change the shape parameter.
newM, newN = shape
M, N = self.shape
if newM < M or newN < N:
# Remove all elements outside new dimensions
for (i, j) in list(self.keys()):
if i >= newM or j >= newN:
del self[i, j]
self._shape = shape
If you know for sure that there aren't any outside keys, you could change shape directly. The other sparse formats don't have a resize
method.
In [31]: d=sparse.dok_matrix((10,10),int)
In [32]: d
Out[32]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [33]: d.resize((5,5))
In [34]: d
Out[34]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
In [35]: d._shape=(9,9)
In [36]: d
Out[36]:
<9x9 sparse matrix of type '<class 'numpy.float64'>'
with 0 stored elements in Dictionary Of Keys format>
See also:
Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?
Get top-n items of every row in a scipy sparse matrix