scikit-learn: clustering text documents using DBSC

I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.

My minimal code is as follows:

docs = []
for item in [database]:
    docs.append(item)

vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)

X = X.todense() # <-- This line was needed to resolve the isse

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...

(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)

When I run this code, I get the following error when instatiating DBSCAN and calling fit():

...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range

Clicking on the line in dbscan_.py that throws the error, I noticed the following line

...
X = np.asarray(X)
n = X.shape[0]
...

When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:

...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.

标签： machine-learning scikit-learn cluster-analysis data-mining dbscan

2条回答

Viruses.

2楼-- · 2020-02-21 02:15

It looks like sparse representations for DBSCAN are supported as of Jan. 2015.

I upgraded sklearn to 0.16.1 and it worked for me on text.

0人赞添加讨论(0) 举报

太酷不给撩

3楼-- · 2020-02-21 02:19

The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.

Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.

You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.

You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.

Mean Shift may actually need your data to be vector space of fixed dimensionality.

0人赞添加讨论(0) 举报

scikit-learn: clustering text documents using DBSC

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间