compute the mean and the covariance of a large mat

2019-06-07 02:40发布

问题:

I am using Numpy and trying to compute the mean and the covariance of a large matrix(300000 x 70000). I have 32GB-size memory avaiable. What's the best practice for this task in term of computational efficiency and easiness of implementation?

My current implementation is as follows:

def compute_mean_variance(mat, chunk_size):
    row_count = mat.row_count
    col_count = mat.col_count
    # maintain the `x_sum`, `x2_sum` array
    # mean(x) = x_sum / row_count
    # var(x) = x2_sum / row_count - mean(x)**2
    x_sum = np.zeros([1, col_count])
    x2_sum = np.zeros([1, col_count])

    for i in range(0, row_count, chunk_size):
        sub_mat = mat[i:i+chunk_size, :]
        # in-memory sub_mat of size chunk_size x num_cols
        sub_mat = sub_mat.read().val
        x_sum += np.sum(sub_mat, 0)
        x2_sum += x2_sum + np.sum(sub_mat**2, 0)
    x_mean = x_sum / row_count
    x_var = x2_sum / row_count - x_mean ** 2
    return x_mean, x_var

Any suggestions for improvements?

I find the following implementation should more understandable. Also it use numpy to calculate the mean and standard deviation for the chunks of columns. So it should be more efficient and numerically stable.

def compute_mean_std(mat, chunk_size):
    row_count = mat.row_count
    col_count = mat.col_count
    mean = np.zeros(col_count)
    std = np.zeros(col_count)

    for i in xrange(0, col_count, chunk_size):
        sub_mat = mat[:, i : i + chunk_size]
        # num_samples x chunk_size
        sub_mat = sub_mat.read().val
        mean[i : i + chunk_size] = np.mean(sub_mat, axis=0)
        std[i : i + chunk_size] = np.std(sub_mat, axis=0)

    return mean, std

回答1:

I assume that in order to calculate the variance, you are using what Wiki calls the Naïve algorithm. However, one might find there:

Because x2_sum / row_count and x_mean ** 2 can be very similar numbers, cancellation can lead to the precision of the result to be much less than the inherent precision of the floating-point arithmetic used to perform the computation. Thus this algorithm should not be used in practice. This is particularly bad if the standard deviation is small relative to the mean.

As an alternative, you might use the two-pass algorithm, i.e., to first calculate the mean and then use it in the calculation of the variance. In principle, it would seem that this is wasteful since one has to iterate twice over the data. However, the "mean" used in the calculation of the variance does not need to be the true mean, a reasonable estimate (perhaps calculated only from the first chunk) would suffice. This would then reduce to the method of assumed mean.

Also, a possibility would be to delegate the calculation of the mean/variance of each chunk directly to numpy and then combine them in order to obtain the overall mean/variance using the parallel algorithm.



回答2:

So I had a project at University that involved time complexity tests for different matrix multiplication algorithms.

I've uploaded the source code here.

One of the optimizations I found was that you can optimize the array access by changing the structure of your for-loops to focus on rows at a time rather than traversing columns. This is due to the way the caches behave with spatial locality (i.e. your computer tries to optimize for array elements that are side by side rather than row by row in a 2D array)

Also if these are 'sparse' matrices (a lot of zeroed elements) you can change the data structure to only record the non-zeroed elements.

Obviously if you're given a normal matrix, the computation of transforming them to a sparse matrix is probably not worth it, but I just thought these were observations worth sharing :)