I've got a situation where I'm using Cassandra for DB and I need full-text search capability. Now I'm aware of Apache Solr, Apache Cassandra, and DSE search.
However, I do not want to use a costly and proprietary software(DSE search). The reason I do not want to use Apache Solr is because I don't want to deal with HA, sharding, and redundency for it. Cassandra is perfect for HA, sharding, and redundency; I would like to store my full-text index in the existing Cassandra DB.
So what I'm looking for is something that will break down a string into its indexable parts. For example:
String input = "I like apples and bannanas.";
String tokens[] = makeTokenIndex(input);
//tokens = {"I","like","apples","bannanas","apple","bannana"}
Obviously I could split strings on spaces and use the words as index-keys. But I'm looking for something smarter than that. Something that can handle plurals, find the root of a word, etc...
Would modifying Apache Lucene be the best solution for this, or is there another option?