I am new to both python and scikit-learn so please bear with me.
I took this source code for k means clustering algorithm from k means clustering.
I then modified to run on my local set by using load_file function.
Although the algorithm terminates, but it does not produce any output like which documents are clustered together.
I found that the km object has "km.label" array which lists the centroid id of each document.
It also has the centroid vector with "km.cluster_centers_"
But what document it is ? I have to map it to "dataset" which is a "Bunch" object.
If i print dataset.data[0], i get the data of first file which i think are shuffled. but i just want to know the name.
I am confused with questions like Does the document at dataset.data[0] is clusterd to centoid at km.label[0] ?
My basic problem is to find which files are clustered together.
How to find that ?
Forget about the Bunch
object. It's just an implementation detail to load the toy datasets that are bundled with scikit-learn.
In real life, with you real data you just have to call directly:
km = KMeans(n_clusters).fit(my_document_features)
then collect cluster assignments from:
km.labels_
my_document_features
is a 2D datastructure: either a numpy array or a scipy.sparse matrix with shape (n_documents, n_features)
.
km.labels_
is a 1D numpy array with shape (n_documents,)
. Hence the first element in labels_
is the index of the cluster of the document described in the first row of the my_document_features
feature matrix.
Typically you would build my_document_features
with a TfidfVectorizer
object:
my_document_features = TfidfVectorizer().fit_transform(my_text_documents)
and my_text_documents
would a either a list python unicode objects if you read the documents directly (e.g. from a database or rows from a single CSV file or whatever you want) or alternatively:
vec = TfidfVectorizer(input='filename')
my_document_features = vec.fit_transform(my_text_files)
where my_text_files
is a python list of the path of your document files on your harddrive (assuming they are encoded using the UTF-8 encoding).
The length of the my_text_files
or my_text_documents
lists should be n_documents
hence the mapping with km.labels_
is direct.
As scikit-learn is not just for clustering or categorizing documents, we use the name "sample" instead of "document". This is way you will see the we use n_samples
instead of n_documents
to document the expected shapes of the arguments and attributes of all the estimator in the library.
dataset.filenames
is the key :)
This is how i did it.
load_files declaration is :
def load_files(container_path, description=None, categories=None,
load_content=True, shuffle=True, charset=None,
charse_error='strict', random_state=0)
so do
dataset_files = load_files("path_to_directory_containing_category_folders");
then when i got the result :
i put them in the clusters which is a dictionary
clusters = defaultdict(list)
k = 0;
for i in km.labels_ :
clusters[i].append(dataset_files.filenames[k])
k += 1
and then i print it :)
for clust in clusters :
print "\n************************\n"
for filename in clusters[clust] :
print filename