How do I compute the variance of a column of a spa

2020-08-15 01:08发布

问题:

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.

scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?

回答1:

You can calculate the variance yourself using the mean, with the following formula:

E[X^2] - (E[X])^2

E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.



回答2:

Sicco has the better answer.

However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):

# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
    arr[i] = np.var(mat[:, i].toarray())


回答3:

The efficient way is actually to densify the entire matrix, then standardize it in the usual way with

X = X.toarray()
X -= X.mean()
X /= X.std()

As @Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.