I have hit a real problem. I need to do some Kmeans clustering for 5 million vectors, each containing about 32 cols. I tried out Mahout which requires linux and I am on windows, I am restrained from using a Linux OS and any sort of simulator.
Can anyone suggest a KMeans clustering algorithm that is scalable upto 5M vectors and can converge quickly?
I have tested a few but they wont scale. Which means they are slow and take forever to complete.
Thanks
OK, So who ever wants clustering for large scale datasets, the only way of doing so is to use Mahout. IT requires a linux platform. So I had to use virtual box, placed Ubuntu on it and then used Mahout. Its a lengthy procedure to set up Mahout, but the two links that I used are as follows.
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)