I'd like to ensure that searching for, say, I.B.M.
can be found by searching for ibm
. I'd also like to make sure that Dismemberment Plan
could be found by searching for dismember
.
Using Solr, what tokenizer and filters can I use in analysis and query time to permit both kinds of results?
For I.B.M. => ibm
you would need a solr.WordDelimiterFilterFactory, which would strip special chars and catenate word and numbers
catenateWords="1" would catenate the words and transform I.B.M to IBM.
Dismemberment => dismember
Need to include a stemmer filter (e.g. solr.PorterStemFilterFactory, solr.EnglishMinimalStemFilterFactory) which would index the roots of the words and provide matches for words which have the same roots.
In addition you can use solr.LowerCaseFilterFactory for case insensitive matches (IBM and ibm), solr.ASCIIFoldingFilterFactory for handling foreign characters.
You can always use SynonymFilterFactory to map words which you think are synonyms.
you can apply this at both query and index time, so that they match and convert during both and the results are consistent.
e.g. field type def -
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters