How to incrementally create an sparse matrix on py

I am creating a co-occurring matrix, which is of size 1M by 1M integer numbers. After the matrix is created, the only operation I will do on it is to get top N values per each row (or column. as it is a symmetric matrix).

I have to create matrix as sparse to be able to fit it in memory. I read input data from a big file, and update co-occurance of two indexes (row, col) incrementally.

The sample code for Sparse dok_matrix specifies that I should declare the size of matrix before hand. I know the upper boundary for my matrix (1m by 1m), but in reality it might has less than that. Do I have to specify the size beforehand, or can i just create it incrementally?

import numpy as np
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
    for j in range(5):
        S[i, j] = i + j    # Update element

A SO question from a couple of days ago, creating sparse matrix of unknown size, talks about creating a sparse matrix from data read from a file. There the OP wanted to use lil format; I recommended building the input arrays for a coo format.

In other SO questions I've found that adding values to a plain dictionary is faster than adding them to a dok matrix - even though a dok is a dictionary subclass. There's quite a bit of over head in the dok indexing method. In some cases I suggested building a dict with a tuple key, and using update to add the values to a defined dok. But I suspect in your case the coo route is better.

dok and lil are the best formats for incremental construction, but neither is that great compared to python list and dict methods.

As for the top N values of each row, I recall exploring that, but back some time, so can't off hand pull up a good SO question. You probably want a row oriented format such as lil or csr.

As for the question - 'do you need to specify the size on creation'. Yes. Because a sparse matrix, regardless of format, only stores nonzero values, there's little harm in creating a matrix that is too large.

I can't think of anything in a dok or coo format that hinges on the shape - at least not in terms of data storage or creation. lil and csr will have some extra values. If you really need to explore this, read up on how values are stored, and play with small matrices.

==================

It looks like all the code for dok format is Python in

/usr/lib/python3/dist-packages/scipy/sparse/dok.py

Scanning that file, I see that dok does have a resize method

d.resize?
Signature: d.resize(shape)
Docstring:
Resize the matrix in-place to dimensions given by 'shape'.

Any non-zero elements that lie outside the new shape are removed.
File:      /usr/lib/python3/dist-packages/scipy/sparse/dok.py
Type:      method

So if you want to initial the matrix to 1M x 1M and resize to 100 x 100 you can do so - it will step through all the keys to make sure there aren't any outside the new range. So it isn't cheap, even though the main action is to change the shape parameter.

    newM, newN = shape
    M, N = self.shape
    if newM < M or newN < N:
        # Remove all elements outside new dimensions
        for (i, j) in list(self.keys()):
            if i >= newM or j >= newN:
                del self[i, j]
    self._shape = shape

If you know for sure that there aren't any outside keys, you could change shape directly. The other sparse formats don't have a resize method.

In [31]: d=sparse.dok_matrix((10,10),int)

In [32]: d
Out[32]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Dictionary Of Keys format>

In [33]: d.resize((5,5))

In [34]: d
Out[34]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Dictionary Of Keys format>

In [35]: d._shape=(9,9)

In [36]: d
Out[36]: 
<9x9 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in Dictionary Of Keys format>