Sparse Principal Component Analysis using sklearn

2019-08-19 04:20发布

问题:

I'm trying to replicate an application from this paper, where the authors download the 20 newsgroups data and use SPCA to extract the principal components that in some sense best describe the text corpus [see section 4.1]. This is for a high dimensions class project where we were asked to pick a topic and replicate/apply it.

The output should be K principal components, which each have a short list of words that all intuitively correspond to a certain theme (for example, the paper finds the first PC is all about politics and religion).

From my research it seems like the best way to reproduce the application from this paper is using this algorithm: sklearn.decomposition.MiniBatchSparsePCA.

I have found only one example of how this alogrithm works, here.

So my question is this: Is it, in principal, possible to follow the steps in the above linked example, using text data to reproduce the application from section 4.1 in the paper linked in the first paragraph?

If it is, I would then be able to ask more concrete question regarding the code.