Algorithm to complete a corrupted matrix of data

2019-04-27 12:08发布

问题:

I have the following issue:

I extracted a set of data but part of these data are either not available or missing; for different items I identified 10 parameters:

       param1   param2    ...  param10
Item 1   1220     N/A            1000
Item 2   1300     200     ...    1000
..        ...      ...

item N    N/A      1000   ...     200

N ~ 1500 and half of the values are complete

There is an implicit logic in the creation of items, so I would like to fill out these values with the best expected value possible.

Example:

Let's imagine you have 2 parameters and 3 items.

       param1  param2
item1    400    200
item2    200    100
item3    100     N/A

With linear interpolation you would easily get param2 for item3 = 50.

My idea:

As I have 10 parameters and 1500 values, I thought of doing a PCA on the covariance matrix of the 750 items that are complete (finding the main direction of the set of data).

The PCA will lead me to one main direction for my items (largest eigen value), and sub direction for sub groups of items (smaller eigen values).

I wanted to project the vectors with missing parameters on the main direction for example. to get the approximate value of the missing parameters.

From my first example :

       param1  param2
item1    400    200
item2    200    100
item3    100     X ?

Complete matrix:

param1  param2
item1    400    200
item2    200    100

Covariance matrix:

   1    0.5
   0.5  1 

eigen vectors and eigen values:

V1 and l1:

1
1   associatedd to 1.5

V2 and l2:

1
-1  associated to 0.5

result:

If I project on V1 only I get X1=100.

If I project on l1.V1 + l2.V2 I get X1=50. This is because there is a perfect correlation between the first 2 items.


So my question:

So far it's only theory, I haven't applied it yet, but before I start I would like to know if I'm going somewhere with this.

Can I do better? (I really believe yes.) What can I do if all items have one missing parameter? Where do I get the direction from?

Are there known good algorithms to fill in corrupted matrices, or can you help me complete my idea (recommending to me good readings or methods)?

I think Netflix uses this kind of algorithm to fill in the film score matrix automatically for example (Netflix 1M dollar problem).

If you believe this belongs to another stackexchange site, feel free to migrate it.

回答1:

Why not to use numeric predictions from machine learning? In your first example params are attributes and items are instances. With it you can try linear regression or neural networks or anything else in a couple of minutes. After training you will get next equation for your first example (param2 here is marked as a class):

param2 = 0 + 1/2 * param1

which is exactly what you want.

If you're not sure that relations between params are linear, you can always try other types of regression (ANN, SVM, anything).

For a quick start use Weka. Convert your data to CSV, load it into Weka and start playing. For numeric predictions look at "Classification" tab.



回答2:

This article by Simon Funk describes his use of an approach like yours for the Netflix prize challenge; perhaps this is what you were thinking of when you mentioned it. Unlike your approach, it handles missing data. The essence is to replace a straightforward use of matrix methods to determine the singular value decomposition of the data matrix with a roughly equivalent optimization problem that more naturally accounts for missing data.



回答3:

Try the NIPALS algorithm. It's the standard method from the field of "Chemometrics". Its a PCA method specifically designed for missing data. You can then back project your scores and loading (t*p') to fill the gaps according to the model of the data. The beauty of this approach is that you don't bias the data by imputation, you just use the data you have. Try searching for papers by Herman or Svante Wold, or there are implementations in R and Matlab. Obviously the more missing data the less reliable the results but for missing at random you can have quite large amounts of missing data.

The legend is that Herman invented the algorithm to rank racehorses in the US - a massive missing data problem (if you think about it, not all horses meet)!