Search Number Plates using Solr

2019-03-22 07:55发布

问题:

I am working on searching a database containing huge db of records of number plates. I am planning to use Apache Solr for implementing the search feature. I don't know the term how to call the search feature I want to implement. But let me explain my requirements to you:

When people search, I want a Solr to subtitute certain numbers for letters? Eg.

12 = R

13 = B

4 = A

11 = H

etc etc?

So for example, when someone search for "John" a search result will be offered should have following suggestions from available list of number plates.

JO11 NYJ - Search should substitute 11 for H!

For example, have a look at http://www.privatenumberplates.com/list/JOHN

I am not sure how I can get this done in Solr, any idea to get started with handling this in Solr would be great! What should be most appropriate to use? Synonym, soundex, fuzzy or something else? What analyzers / stemming libraries should be used?

回答1:

A number of PatternReplaceCharFilterFactory to convert number->letter (one per conversion you need to cover) plus a phonetic filter to match similar sounding words could work as a starting point.

You should do this both at index and query time. This should work...BUT you probably would want 'john' to match 'john' with a higher score than 'jo11n' right?

So you should use copyfields to match (with different boosts) several fields, one original, one with the number->letter conversion applied, one with the phonetic filter applied, etc. You can get as fancy as you need.

You might also write your own Analizer, but I would leave it for later, in case using the built in ones is not good enough.



回答2:

I like Persimmonium's answer, I write to detail it a bit further. An analyzer might look like this:

<fieldType name="character_alias" class="solr.TextField">
    <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="synonym_characters.txt" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
    </analyzer>
</fieldType>

I have chosen the MappingCharFilter instead of the suggested PatternReplaceCharFilterFactory as it allows to provide a list with characters that shall be replaced. This is more handy.

A synonym_character.txt might look like this

"11" => "H"
"12" => "R"
"4" => "A"

For the phonetic part I have chosen the BeiderMorseFilter. Although it is made for surnames, not given names, it delivers rather good results when running it with a small batch of samples from the site you have linked:

+--+---------+----------+
|id|namePlate|score     |
+--+---------+----------+
|2 |john     |1.2513144 |
+--+---------+----------+
|3 |jo11n    |1.2513144 |
+--+---------+----------+
|4 |jon 52   |0.54745007|
+--+---------+----------+
|6 |107 jon  |0.54745007|
+--+---------+----------+
|8 |jon 52   |0.54745007|
+--+---------+----------+
|5 |40 jon   |0.4692429 |
+--+---------+----------+


回答3:

<fieldType name="character_alias" class="solr.TextField">
    <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="synonym_characters.txt" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
    </analyzer>
</fieldType>

using this we can map

"H" => "11"
"4" => "A"
"8" => "A"

in this way it also map "4" => "8". I don't know to avoid this problem.