Using SOLR to calculate “similarity”/“bitcount” be

2019-02-07 08:24发布

问题:

We have a database of images where I have calculated the PHASH using Dr. Neal Krawetz's method as implemented by David Oftedal.

Part of the sample code calculates the difference between these longs is here:

ulong hash1 = AverageHash(theImage);
ulong hash2 = AverageHash(theOtherImage);

uint BitCount(ulong theNumber)
{
    uint count = 0;
    for (; theNumber > 0; theNumber >>= 8) {
        count += bitCounts[(theNumber & 0xFF)];
    }
    return count;
}

Console.WriteLine("Similarity: " + ((64 - BitCount(hash1 ^ hash2)) * 100.0) / 64.0 + "%");

The challenge is that I only know one of these hashes and I want to query SOLR to find other hashes in order of similarity.

A few notes:

  1. Using SOLR here (only alternative I have is HBASE)
  2. Want to avoid installing any custom java into solr (happy to install an existing plugin)
  3. Happy to do lots of pre-processing in C#
  4. Happy to use multiple fields to store data as a bit string, long, etc
  5. Using SOLRNet as a client

Edit, some extra information (apologies I am caught up in the problem and started assuming it was a widely known area). Here is a direct download to the C# console / sample app: http://01101001.net/Imghash.zip

An example output of this console app would be:

004143737f7f7f7f phash-test-001.jpg
0041417f7f7f7f7f phash-test-002.jpg
Similarity: 95.3125%

回答1:

You can use Solr's Fuzzy Search for this, you have to scroll down a bit on the page.

Solr's standard query parser supports fuzzy searches based on the Levenshtein Distance or Edit Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being an exact match. To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term.

Assuming you have a schema like below, where this field phash holds the phash you have calculated.

<fields>
    <!-- ... all your other fields ... -->
    <field name="phash" type="string" indexed="true" stored="true" />
</fields>

You may perform a query like

q=phash:004143737f7f7f7f~0.8&
fl=score,phash

This will return all documents that have a PHASH with a Levenshtein Distance or Edit Distance of at least 80%. You will not get the 95.3125% you have given in your question, but a 87,5% as matching/not matching characters are counted.

When you want to see that value, you may perform the following query

q=phash:004143737f7f7f7f~0.8&
fl=score,phash,strdist("0041417f7f7f7f7f", phash, edit)

This is a function call to fetch the String Distance using the Levenstein or Edit distance and will deliver a result similar to

+----------------+---------------------------------------+
|hash            |strdist("0041417f7f7f7f7f", hash, edit)|
+----------------+---------------------------------------+
|0041417f7f7f7f7f|1.0                                    |
+----------------+---------------------------------------+
|004143737f7f7f7f|0.875                                  |
+----------------+---------------------------------------+

When you want to reduce the gap between 95.3125% and 87,5% you should consider to store the PHASH not as hexadecimal value, but as octal for instance.