Sklearn.KMeans : how to avoid Memory or Value Erro

2020-06-18 03:11发布

问题:

I'm working on an image classification problem and I'm creating a bag of words model. To do that, I extracted the SIFT descriptors of all my images and I have to use the KMeans algorithm to find the centers to use as my bag of words.

Here is the data I have:

Number of images: 1584
Number of SIFT descriptors (vector of 32 elements): 571685
Number of center: 15840

So I ran a KMeans algorithm to compute my centers:

dico = pickle.load(open('./dico.bin', 'rb')) # np.shape(dico) = (571685, 32)
k = np.size(os.listdir(img_path)) * 10 # = 1584 * 10

kmeans = KMeans(n_clusters=k, n_init=1, verbose=1).fit(dico)

pickle.dump(kmeans, open('./kmeans.bin', 'wb'))
pickle.dump(kmeans.cluster_centers_, open('./dico_reduit.bin', 'wb'))

With this code, I got a Memory Error because I don't have enough memory on my laptop (only 2GB) so I decided to divide by 2 the number of center and to choose randomly half of my SIFT descriptors. This time, I got Value Error : array is too big.

What can I do to get a relevant result without memory problem?

回答1:

As @sascha said in this comment, I just have to use MiniBatchKMeans class to avoid this problem:

dico = pickle.load(open('./dico.bin', 'rb'))

batch_size = np.size(os.listdir(img_path)) * 3
kmeans = MiniBatchKMeans(n_clusters=k, batch_size=batch_size, verbose=1).fit(dico)

pickle.dump(kmeans, open('./minibatchkmeans.bin', 'wb'))