I have a CSV file which I would like to convert to a SequenceFile, which I would ultimately use to create NamedVectors to use in a clustering job. I've been using the seqdirectory command to try to make a SequenceFile, and then fed that output into seq2sparse with the -nv option to create NamedVectors. It seems like this is giving one big vector as an output, but I ultimately want each line of my CSV to become a NamedVector. Where am I going wrong?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
seqdirectory
command takes every file as a document, so in reality, you only have one document, hence you only get one vector. To make it work properly you would make each line of your CSV file a file itself, where the key of the document is the name of the file and the value are its content. Nonetheless, this is quite unpractical if your corpus is large as disk reading and writing can become painfully slow.
In practice you are better off following the links I share in this comment