I have a 60000 documents which i processed in gensim
and got a 60000*300 matrix. I exported this as a csv
file. When i import this in ELKI
environment and run Kmeans
clustering, i am getting below error.
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
The error (which took me a bit to understand, when I saw it the first time) says that your data has the "shape"
I.e. some lines have only 266 columns, some have 300. This may be a file format issue, for example due to NaN, missing values, or similar bad characters.
You get that error if you try to run an algorithm like kmeans that assumes the data comes from a R^d vectorspace (that is the
NumberVector,field
requirement), because the input data is not meeting this requirement.This sounds strange, but i found the solution to this issue by opening the exported
CSV
file and doingSave As
and saving again as aCSV
file. While size of the original file is 437MB, the second file is 163MB. I used the numpy functionnp.savetxt
for saving thedoc2vec
vector. So it seems to be aPython
issue instead of beingELKI
issue.Edit: Above solution is not useful. I instead exported the
doc2vec
output which was created usinggensim
library and while exporting format of the values were decided explicitly as%1.22e
. i.e. the values exported are in exponential format and values have length of 22. Below is the entire line of code.CSV
file thus created runs without any issue in ELKI environment.