Consider the following (excel) dataset:
m | r
----|------
2.0 | 3.3
0.8 |
| 4.0
1.3 |
2.1 | 5.2
| 2.3
| 1.9
2.5 |
1.2 | 3.0
2.0 | 2.6
My goal is to fill in missing values using the following condition:
Denote as R the pairwise correlation between the above two columns (around 0.68). Denote as R* the correlation after the empty cells have been filled in. Fill in the table so that (R - R*)^2 = 0. This is, I want to keep the correlation structure of the data intact.
So far I have done it using Matlab:
clear all;
m = xlsread('data.xlsx','A2:A11') ;
r = xlsread('data.xlsx','B2:B11') ;
rho = corr(m,r,'rows','pairwise');
x0 = [1,1,1,1,1,1];
lb = [0,0,0,0,0,0];
f = @(x)my_correl(x,rho);
SOL = fmincon(f,x0,[],[],[],[],lb)
where the function my_correl
is:
function X = my_correl(x,rho)
sum_m = (11.9 + x(1) + x(2) + x(3));
sum_r = (22.3 + x(1) + x(2) + x(3));
avg_m = (11.9 + x(1) + x(2) + x(3))/8;
avg_r = (22.3 + x(4) + x(5) + x(6))/8;
rho_num = 8*(26.32 + 4*x(1) + 2.3*x(2) + 1.9*x(3) + 0.8*x(4) + 1.3*x(5) + 2.5*x(6)) - sum_m*sum_r;
rho_den = sqrt(8*(22.43 + (4*x(1))^2 + (2.3*x(2))^2 + (1.9*x(3))^2) - sum_m^2)*sqrt(8*(78.6 + (0.8*x(4))^2 + (1.3*x(5))^ + (2.5*x(6))^2) - sum_r^2);
X = (rho - rho_num/rho_den)^2;
end
This function computes the correlation manually, where every missing data is a variable x(i)
.
The problem: my actual dataset has more than 20,000 observations. There is no way I can create that rho formula manually.
How can I fill in my dataset?
Note 1: I am open to use alternative languages like Python, Julia, or R. Matlab it's just my default one.
Note 2: a 100 points bounty will be awarded to the answer. Promise from now.
This is how I would approach it, with an implementation in R provided:
There is not a unique solution for imputing the missing data points, such that the pairwise correlation of the complete (imputed) data is equal to the pairwise correlation of the incomplete data. So to find a 'good' solution rather than just 'any' solution, we can introduce an additional criteria that the complete imputed data should also share the same linear regression with the original data. This leads us to a fairly simple approach.
A solution like this in R:
As a final check, calculate linear regression of the complete imputed data and plot to show that regression line is same as for original data. Note that the plot below was for the large data set shown below, to demonstrate use of this method on large data.
DATA
original toy data from OP example:
A larger data set to demonstrate on big data