What could be an algorithm for computation of Pearson cross-correlation matrix in a distributed environment where my data is divided by id (say: 1-4) and time (say: Jan-Dec) among different nodes.
For example:
Node A({id1, Jan}, {id2, Jan}); Node B({id3, Jan}, {id4, Jan}),
Node C({id1, Feb}, {id2, Feb}); Node A({id1, March}{id2, March}),
Node C({id3, Feb}, {id4, Feb}); Node B({id3, March}, {id4, March})
Basically, I meant to say Jan data for all id is not at one node.
I'm wondering what strategy I could use where I do not have to ship large data from one node to another node as Pearson correlation is a pairwise computation. I'm ok with just transferring small intermediate result between nodes. How should I partition my data based on id and time so that I efficiently calculate cross-correlation matrix among multiple ids.
The language of choice is C++
The correlation between two vectors of data is cor(X,Y) = cov(X,Y)/[sd(X) * sd(Y)]
. Is there any way to break these up into block computations? The essential computation required (since sd(X) = sqrt(cov(X,X)
) is
cov(X,Y) = <X Y> - <X> <Y>
= 1/N (sum[i] X[i] Y[i]) - 1/N (sum[i] X[i]) * 1/N (sum[i] Y[i])
This is a sum over all indices i. Each index i, however, corresponds to a node n with N_n
events and a sub-index (in that node) k_n
:
cov(X,Y) = 1/N (sum[n] sum[k_n] X[k_n] Y[k_n])
- 1/N^2 (sum[n] sum[k_n] X[k_n]) * (sum[n] sum[k_n] Y[i])
Since N = sum[n] N_n
, this can be rewritten as
cov(X,Y) = (sum[n] N_n/N 1/N_n sum[k_n] X[k_n] Y[k_n])
- (sum[n] N_n/N 1/N_n sum[k_n] X[k_n]) * (sum[n] N_n/N 1/N_n sum[k_n] Y[i])
= (sum[n] N_n/N <XY>_n) - (sum[n] N_n/N <X>_n) * (sum[n] N_n/N <Y>_n)
So, each node need only report its number of entries N_n
and the means <X>_n, <Y>_n
, and <XY>_n
(and, for the purposes of the correlation, <X^2>_n
and <Y^2>_n
) within the node. The global covariance can then be calculated via summing these means together with the appropriate weights N_n/N
(where again N = sum[n] N_n
) to get the global means.
Edit: LaTeX version
Since these equations are hard to parse without LaTeX, here are some more understandable image versions. The covariance of two lists of data X and Y is defined to be
where each quantity <X>, <Y>
, and <XY>
is a mean (of the list X, the list Y, and the pairwise product list XY). The computation of the means can be broken down as a weighted sum over the various nodes. Calling any of X, Y, XY, or X^2 or Y^2 (necessary to compute the correlation) Z, the mean of Z is:
where <Z>_k
is the mean of Z on the k-th node and N_k
is the number of data points in the k-th node. This reduces the amount of information needed from each node to N_k, <X>_k, <Y>_k, <XY>_k, <X^2>_k
, and <Y^2>_k
.
Take a look at this article, as it could explain it a little further: https://en.wikipedia.org/wiki/Covariance_matrix
Let's take two variables X and Y that you measured, meaning that you can provide two arrays of the same length so that {x_i} are the measured values of X and {y_i} are the measured values of Y.
From a philosophical point of view, the covariance of two variables X and Y expresses how strong is the probability that to the variation of X corresponds a variation of Y.
To compute the covariance matrix you need three elements:
- is the arithmetic average of X
- is the arithmetic average of Y
- is the arithmetic average of the element-wise product of the arrays X and Y
Provided that cov(X,Y) = cov(Y,X) and that cov(X,X) = cov(Y,Y) = 1, you can use these properties to minimize the computation required and the need for data to be transfered, for you only need to compute the elements in the upper diagonal of the matrix.
For instance, if you have two variables you have to compute only one element, for three variables you need to compute 3 elements, and so on...
Will this paper help you? https://pdfs.semanticscholar.org/f02f/0df4922351375aa304de7de296393cdf7224.pdf
"The first algorithm is a parallel version of Quadrant Correlation
(QC), and the second is a parallel version of the Maronna method.
Parallel QC uses a parallel matrix library and can handle
single-dimensional outliers in its data. The parallel Maronna method
divides the independent correlation calculations between the
processors and is capable of detecting one and two dimensional
outliers in data."
Another similar question: Distributed cross correlation matrix computation