Near duplicate detection in Solr

2019-04-02 16:40发布

问题:

Solr is being used to search through a database of user-generated listings. These listings are imported into Solr from MySQL via the DataImportHandler.

Problem: Quite often, users report the same listing to the database, sometimes with minor changes to their listing post to avoid being easily detected as a duplicate post.

How should I implement a near-duplication detection with Solr? I do not mind having near-duplicate listings in the Solr index as long as the search results do not contain these near-duplicate listings.

I guess there are 4 possible places to do this near-duplicate detection

  1. When the user submits the listing (PHP is being used here)
  2. During the data import from MySQL to Solr
  3. After the data import from MySQL
  4. When a search is being done

What is the recommended way to do this? Thank you!

回答1:

i'm not familiar with Solr, i would implement the "near-duplication" when the user submits the listing. There are quit different algorithms to detect near-duplicates like the Jaccard Indexing.

I made a little script to see the difference between the similarity coefficients:

<?php

$input1 = "Hello there, this is a test 1, you see it's almost the same";
$input2 = "Hello there, this is a test 2, you saw it, it's almost the same";
$input3 = "this is very different from the others, but who knows ?";

echo jackard($input1, $input1) . "<br />"; // results 1

echo jackard($input1, $input2) . "<br />"; // results 0.81481481481481

echo jackard($input1, $input3) . "<br />"; // results 0.25

echo jackard($input2, $input3); // results 0.24


function jackard($a, $b){
    $a_arr = explode(" ", $a);
    $b_arr = explode(" ", $b);
    $intersect_a_b = array_intersect($a_arr,$b_arr);
    return((count($intersect_a_b)/(count($a_arr)+count($b_arr)))*2);
}
?>

You may see, that if the result is 1, it means that it's the same sentence OR it uses the same words in a different order. However, the smaller the value is, the more unique the "sentence" is. This is rather a simple implementation. You may set a limit value for example 0.4. And set the "request" in a queue if it passes this limit. And then take a look manualy at the listing. This is not "efficient". But i gave you the idea, and it's up to you to develop a more complex and automated system/algorithm. And maybe you should also take a look here.