carrot2 - can I cluster documents from a folder?

2019-05-23 16:44发布

I'm trying to cluster documents I have collected as part of a research project. I am trying to use Carrot2 workbench and can't find out how to point carrot at the folder containing the documents. How do I do this please? (I have a small number of documents (.txt) to compare and they're on a standalone research machine so I can't connect to the web and process them there).

Any help gratefully received!

(I am trying to identify similarities/themes/groups across the documents; if Carrot2 isn't the right tool then would be grateful for alternative suggestions!)

Many thanks,

John

标签: carrot2
2条回答
Lonely孤独者°
2楼-- · 2019-05-23 16:58

I recently had built a document clustering software. This software is build in java. This software is absolutely free. Document organizer software can cluster a huge collection of document of following extensions:

  • txt
  • pdf
  • doc
  • docx
  • xls
  • xlsx
  • ppt
  • pptx

If this software doesnt fullfill your requirement please let me know.

Here's the link: http://www.computergodzilla.com

If you want to read more, refer here: http://computergodzilla.blogspot.com/2013/07/document-organizer-software.html

查看更多
对你真心纯属浪费
3楼-- · 2019-05-23 17:09

Currently Carrot2 Workbench does not support clustering files directly from a local folder. There are a few solutions here:

  1. Convert all your text file to Carrot2 XML format and cluster the XML file in Carrot2 Workbench.

  2. Index your files in Apache Solr and query your Solr index from Carrot2 Workbench.

  3. Convert your files to a Lucene index and query the index from Carrot2 Workbench. I wrote a simple utility for that task called folder2index (source code).

    Assuming you're on Windows, the indexing process is the following:

    1. Uzip the folder2index tool somewhere, let's assume you unzipped it to c:\carrot2\folder2index-0.0.1.

    2. To index text files from some directory (let's assume c:\txt-input) and create the index in c:\txt-input-index, do this:

      a. Open command line console (Start menu -> Run program -> type cmd and press Enter).

      b. In the console, type:

      cd c:\carrot2\folder2index-0.0.2
      java -jar folder2index-0.0.2.jar --index c:\txt-input-index --folders c:\txt-input --use-tika
      

      After a short while you should see something like:

      ...
      Index created: c:\txt-input-index
      
    3. Once you've indexed the files, you can cluster them in Carrot2 Workbench, using the Lucene document source. Use the content file name to refer to the content of your text file; the name of the file is stored in the fileName field.

    A couple of notes:

    • Currently only PDF, HTML and TXT files are indexed, other files are ignored.

    • If the index already exists, files are added to the index. This means that if you run the command twice with the same parameters, the index will contain duplicate documents. To re-index a folder to which you've just added some files, it's best to delete the index directory first.

    • You can use the Query field in Carrot2 Workbench to select specific files from the index, e.g.:

      *:* -- retrieves all the content (up to the requested number of results)

      mining -- retrieves all the documents that contain the word "mining" in them (again, up to the requested number of results)

      "data mining" -- retrieves documents that contain the exact phrase "data mining"

      fileName:92* -- retrieves contents of files whose names start with "92"

查看更多
登录 后发表回答