可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.

scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?

回答1:

You can calculate the variance yourself using the mean, with the following formula:

E[X^2] - (E[X])^2

E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.

回答2:

Sicco has the better answer.

However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):

# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
    arr[i] = np.var(mat[:, i].toarray())

回答3:

The efficient way is actually to densify the entire matrix, then standardize it in the usual way with

X = X.toarray()
X -= X.mean()
X /= X.std()

As @Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.