I have the following issue:
I extracted a set of data but part of these data are either not available or missing; for different items I identified 10 parameters:
param1 param2 ... param10
Item 1 1220 N/A 1000
Item 2 1300 200 ... 1000
.. ... ...
item N N/A 1000 ... 200
N ~ 1500 and half of the values are complete
There is an implicit logic in the creation of items, so I would like to fill out these values with the best expected value possible.
Example:
Let's imagine you have 2 parameters and 3 items.
param1 param2
item1 400 200
item2 200 100
item3 100 N/A
With linear interpolation you would easily get param2 for item3 = 50
.
My idea:
As I have 10 parameters and 1500 values, I thought of doing a PCA on the covariance matrix of the 750 items that are complete (finding the main direction of the set of data).
The PCA will lead me to one main direction for my items (largest eigen value), and sub direction for sub groups of items (smaller eigen values).
I wanted to project the vectors with missing parameters on the main direction for example. to get the approximate value of the missing parameters.
From my first example :
param1 param2
item1 400 200
item2 200 100
item3 100 X ?
Complete matrix:
param1 param2
item1 400 200
item2 200 100
Covariance matrix:
1 0.5
0.5 1
eigen vectors and eigen values:
V1 and l1:
1
1 associatedd to 1.5
V2 and l2:
1
-1 associated to 0.5
result:
If I project on V1 only I get X1=100
.
If I project on l1.V1 + l2.V2
I get X1=50
. This is because there is a perfect correlation between the first 2 items.
So my question:
So far it's only theory, I haven't applied it yet, but before I start I would like to know if I'm going somewhere with this.
Can I do better? (I really believe yes.) What can I do if all items have one missing parameter? Where do I get the direction from?
Are there known good algorithms to fill in corrupted matrices, or can you help me complete my idea (recommending to me good readings or methods)?
I think Netflix uses this kind of algorithm to fill in the film score matrix automatically for example (Netflix 1M dollar problem).
If you believe this belongs to another stackexchange site, feel free to migrate it.