ELKI Kmeans clustering Task failed error for high

I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error.

Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList
    at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
    at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
    at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
    at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
    at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
    at [...]

Below is the ELKI settings i have used

标签： cluster-analysis k-means gensim doc2vec elki

2条回答

我命由我不由天

2楼-- · 2020-05-02 03:53

The error (which took me a bit to understand, when I saw it the first time) says that your data has the "shape"

variable,mindim=266,maxdim=300

I.e. some lines have only 266 columns, some have 300. This may be a file format issue, for example due to NaN, missing values, or similar bad characters.

You get that error if you try to run an algorithm like kmeans that assumes the data comes from a R^d vectorspace (that is the NumberVector,field requirement), because the input data is not meeting this requirement.

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2020-05-02 04:09

This sounds strange, but i found the solution to this issue by opening the exported CSV file and doing Save As and saving again as a CSV file. While size of the original file is 437MB, the second file is 163MB. I used the numpy function np.savetxt for saving the doc2vec vector. So it seems to be a Python issue instead of being ELKI issue.

Edit: Above solution is not useful. I instead exported the doc2vec output which was created using gensim library and while exporting format of the values were decided explicitly as %1.22e. i.e. the values exported are in exponential format and values have length of 22. Below is the entire line of code.

textVect = model.docvecs.doctag_syn0
np.savetxt('D:\Backup\expo22.csv',textVect,delimiter=',',fmt=('%1.22e'))

CSV file thus created runs without any issue in ELKI environment.

0人赞添加讨论(0) 举报

ELKI Kmeans clustering Task failed error for high

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间