I've got a rather large dataset that I would like to decompose but is too big to load into memory. Researching my options, it seems that sklearn's IncrementalPCA is a good choice, but I can't quite figure out how to make it work.
I can load in the data just fine:
f = h5py.File('my_big_data.h5')
features = f['data']
And from this example, it seems I need to decide what size chunks I want to read from it:
num_rows = data.shape[0] # total number of rows in data
chunk_size = 10 # how many rows at a time to feed ipca
Then I can create my IncrementalPCA, stream the data chunk-by-chunk, and partially fit it (also from the example above):
ipca = IncrementalPCA(n_components=2)
for i in range(0, num_rows//chunk_size):
ipca.partial_fit(features[i*chunk_size : (i+1)*chunk_size])
This all goes without error, but I'm not sure what to do next. How do I actually do the dimension reduction and get a new numpy array I can manipulate further and save?
EDIT
The code above was for testing on a smaller subset of my data – as @ImanolLuengo correctly points out, it would be way better to use a larger number of dimensions and chunk size in the final code.
As you well guessed the fitting is done properly, although I would suggest increasing the chunk_size
to 100 or 1000 (or even higher, depending on the shape of your data).
What you have to do now to transform it, is actually transforming it:
out = my_new_features_dataset # shape N x 2
for i in range(0, num_rows//chunk_size):
out[i*chunk_size:(i+1) * chunk_size] = ipca.transform(features[i*chunk_size : (i+1)*chunk_size])
And thats should give you your new transformed features. If you still have too many samples to fit in memory, I would suggest using out
as another hdf5 dataset.
Also, I would argue that reducing a huge dataset to 2 components is probably not a very good idea. But is hard to say without knowing the shape of your features
. I would suggest reducing them to sqrt(features.shape[1])
, as it is a decent heuristic, or pro tip: use ipca.explained_variance_ratio_
to determine the best amount of features for your affordable information loss threshold.
Edit: as for the explained_variance_ratio_
, it returns a vector of dimension n_components
(the n_components
that you pass as parameter to IPCA) where each value i inicates the percentage of the variance of your original data explained by the i-th new component.
You can follow the procedure in this answer to extract how much information is preserved by the first n components:
>>> print(ipca.explained_variance_ratio_.cumsum())
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
Note: numbers are ficticius taken from the answer above assuming that you have reduced IPCA to 5 components. The i-th number indicates how much of the original data is explained by the first [0, i] components, as it is the cummulative sum of the explained variance ratio.
Thus, what is usually done, is to fit your PCA to the same number of components than your original data:
ipca = IncrementalPCA(n_components=features.shape[1])
Then, after training on your whole data (with iteration + partial_fit
) you can plot explaine_variance_ratio_.cumsum()
and choose how much data you want to lose. Or do it automatically:
k = np.argmax(ipca.explained_variance_ratio_.cumsum() > 0.9)
The above will return the first index on the cumcum array where the value is > 0.9
, this is, indicating the number of PCA components that preserve at least 90% of the original data.
Then you can tweek the transformation to reflect it:
cs = chunk_size
out = my_new_features_dataset # shape N x k
for i in range(0, num_rows//chunk_size):
out[i*cs:(i+1)*cs] = ipca.transform(features[i*cs:(i+1)*cs])[:, :k]
NOTE the slicing to :k
to just select only the first k
components while ignoring the rest.