I'm trying to cluster documents I have collected as part of a research project. I am trying to use Carrot2 workbench and can't find out how to point carrot at the folder containing the documents. How do I do this please? (I have a small number of documents (.txt) to compare and they're on a standalone research machine so I can't connect to the web and process them there).
Any help gratefully received!
(I am trying to identify similarities/themes/groups across the documents; if Carrot2 isn't the right tool then would be grateful for alternative suggestions!)
Many thanks,
John
Currently Carrot2 Workbench does not support clustering files directly from a local folder. There are a few solutions here:
Convert all your text file to Carrot2 XML format and cluster the XML file in Carrot2 Workbench.
Index your files in Apache Solr and query your Solr index from Carrot2 Workbench.
Convert your files to a Lucene index and query the index from Carrot2 Workbench. I wrote a simple utility for that task called folder2index (source code).
Assuming you're on Windows, the indexing process is the following:
Uzip the folder2index
tool somewhere, let's assume you unzipped it to c:\carrot2\folder2index-0.0.1
.
To index text files from some directory (let's assume c:\txt-input
) and create the index in c:\txt-input-index
, do this:
a. Open command line console (Start menu -> Run program -> type cmd
and press Enter).
b. In the console, type:
cd c:\carrot2\folder2index-0.0.2
java -jar folder2index-0.0.2.jar --index c:\txt-input-index --folders c:\txt-input --use-tika
After a short while you should see something like:
...
Index created: c:\txt-input-index
Once you've indexed the files, you can cluster them in Carrot2 Workbench, using the Lucene document source. Use the content
file name to refer to the content of your text file; the name of the file is stored in the fileName
field.
A couple of notes:
Currently only PDF, HTML and TXT files are indexed, other files are ignored.
If the index already exists, files are added to the index. This means that if you run the command twice with the same parameters, the index will contain duplicate documents. To re-index a folder to which you've just added some files, it's best to delete the index directory first.
You can use the Query field in Carrot2 Workbench to select specific files from the index, e.g.:
*:*
-- retrieves all the content (up to the requested number of results)
mining
-- retrieves all the documents that contain the word "mining" in them (again, up to the requested number of results)
"data mining"
-- retrieves documents that contain the exact phrase "data mining"
fileName:92*
-- retrieves contents of files whose names start with "92"
I recently had built a document clustering software. This software is build in java. This software is absolutely free. Document organizer software can cluster a huge collection of document of following extensions:
- txt
- pdf
- doc
- docx
- xls
- xlsx
- ppt
- pptx
If this software doesnt fullfill your requirement please let me know.
Here's the link:
http://www.computergodzilla.com
If you want to read more, refer here:
http://computergodzilla.blogspot.com/2013/07/document-organizer-software.html