How to use n-grams approximate matching with Solr?

We have a database of movies and series, and as the data comes from many sources of varying reliability, we'd like to be able to do fuzzy string matching on the titles of episodes. We are using Solr for search in our application, but the default matching mechanisms operate on word levels, which is not good enough for short strings, like titles

I had used n-grams approximate matching in the past, and I was very happy to find that Lucene (and Solr) supports something this out of the box. Unfortunately, I haven't been able to configure it correctly.

I assumed that I need a special field type for this, so I added the following field-type to my schema.xml:

<fieldType 
   name="trigrams" 
   stored="true" 
   class="solr.StrField"> 
 <analyzer type="index"> 
   <tokenizer 
       class="solr.analysis.NGramTokenizerFactory" 
       minGramSize="3" 
       maxGramSize="5" 
       /> 
   <filter class="solr.LowerCaseFilterFactory"/> 
 </analyzer> 
</fieldType>

and changed the appropriate field in the schema to:

<field name="title" type="trigrams" 
    indexed="true" stored="true" multiValued="false" />

However, this is not working as I expected. The query analysis looks correctly, but I don't get any results, which makes me believe that something happens at index time (ie. the title is indexed like a default string field instead of trigram field).

The query I am trying is something like

title:"guy walks into a psychiatrist office"

(with a typo or two) and it should match "Guy Walks into a Psychiatrist Office".

(I am not really sure if the query is correct.)

Moreover, I would like to be able to do something more in fact. I'd like to lowercace the string, remove all punctuation marks and spaces, remove English stopwords and THEN change the string into trigrams. However, the filters are applied only after the string has been tokenized...

Thanks in advance for your answers.

标签： search lucene solr approximate

2条回答

戒情不戒烟

2楼-- · 2019-03-16 02:48

The solution turned out to be very simple: AND was set as the default operator, and if any of the ngrams didn't match, the whole query failed. So, it was sufficient to add:

<solrQueryParser defaultOperator="OR" />

in my schema definition.

0人赞添加讨论(0) 举报

How to use n-grams approximate matching with Solr?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间