I have the following issue:
I extracted a set of data but part of these data are either not available or missing; for different items I identified 10 parameters:
param1 param2 ... param10
Item 1 1220 N/A 1000
Item 2 1300 200 ... 1000
.. ... ...
item N N/A 1000 ... 200
N ~ 1500 and half of the values are complete
There is an implicit logic in the creation of items, so I would like to fill out these values with the best expected value possible.
Example:
Let's imagine you have 2 parameters and 3 items.
param1 param2
item1 400 200
item2 200 100
item3 100 N/A
With linear interpolation you would easily get param2 for item3 = 50
.
My idea:
As I have 10 parameters and 1500 values, I thought of doing a PCA on the covariance matrix of the 750 items that are complete (finding the main direction of the set of data).
The PCA will lead me to one main direction for my items (largest eigen value), and sub direction for sub groups of items (smaller eigen values).
I wanted to project the vectors with missing parameters on the main direction for example. to get the approximate value of the missing parameters.
From my first example :
param1 param2
item1 400 200
item2 200 100
item3 100 X ?
Complete matrix:
param1 param2
item1 400 200
item2 200 100
Covariance matrix:
1 0.5
0.5 1
eigen vectors and eigen values:
V1 and l1:
1
1 associatedd to 1.5
V2 and l2:
1
-1 associated to 0.5
result:
If I project on V1 only I get X1=100
.
If I project on l1.V1 + l2.V2
I get X1=50
. This is because there is a perfect correlation between the first 2 items.
So my question:
So far it's only theory, I haven't applied it yet, but before I start I would like to know if I'm going somewhere with this.
Can I do better? (I really believe yes.) What can I do if all items have one missing parameter? Where do I get the direction from?
Are there known good algorithms to fill in corrupted matrices, or can you help me complete my idea (recommending to me good readings or methods)?
I think Netflix uses this kind of algorithm to fill in the film score matrix automatically for example (Netflix 1M dollar problem).
If you believe this belongs to another stackexchange site, feel free to migrate it.
This article by Simon Funk describes his use of an approach like yours for the Netflix prize challenge; perhaps this is what you were thinking of when you mentioned it. Unlike your approach, it handles missing data. The essence is to replace a straightforward use of matrix methods to determine the singular value decomposition of the data matrix with a roughly equivalent optimization problem that more naturally accounts for missing data.
Try the NIPALS algorithm. It's the standard method from the field of "Chemometrics". Its a PCA method specifically designed for missing data. You can then back project your scores and loading (t*p') to fill the gaps according to the model of the data. The beauty of this approach is that you don't bias the data by imputation, you just use the data you have. Try searching for papers by Herman or Svante Wold, or there are implementations in R and Matlab. Obviously the more missing data the less reliable the results but for missing at random you can have quite large amounts of missing data.
The legend is that Herman invented the algorithm to rank racehorses in the US - a massive missing data problem (if you think about it, not all horses meet)!
Why not to use numeric predictions from machine learning? In your first example params are attributes and items are instances. With it you can try linear regression or neural networks or anything else in a couple of minutes. After training you will get next equation for your first example (param2 here is marked as a class):
which is exactly what you want.
If you're not sure that relations between params are linear, you can always try other types of regression (ANN, SVM, anything).
For a quick start use Weka. Convert your data to CSV, load it into Weka and start playing. For numeric predictions look at "Classification" tab.