Solr is being used to search through a database of user-generated listings. These listings are imported into Solr from MySQL via the DataImportHandler.
Problem: Quite often, users report the same listing to the database, sometimes with minor changes to their listing post to avoid being easily detected as a duplicate post.
How should I implement a near-duplication detection with Solr? I do not mind having near-duplicate listings in the Solr index as long as the search results do not contain these near-duplicate listings.
I guess there are 4 possible places to do this near-duplicate detection
- When the user submits the listing (PHP is being used here)
- During the data import from MySQL to Solr
- After the data import from MySQL
- When a search is being done
What is the recommended way to do this? Thank you!
i'm not familiar with Solr, i would implement the "near-duplication" when the user submits the listing. There are quit different algorithms to detect near-duplicates like the Jaccard Indexing.
I made a little script to see the difference between the similarity coefficients:
<?php
$input1 = "Hello there, this is a test 1, you see it's almost the same";
$input2 = "Hello there, this is a test 2, you saw it, it's almost the same";
$input3 = "this is very different from the others, but who knows ?";
echo jackard($input1, $input1) . "<br />"; // results 1
echo jackard($input1, $input2) . "<br />"; // results 0.81481481481481
echo jackard($input1, $input3) . "<br />"; // results 0.25
echo jackard($input2, $input3); // results 0.24
function jackard($a, $b){
$a_arr = explode(" ", $a);
$b_arr = explode(" ", $b);
$intersect_a_b = array_intersect($a_arr,$b_arr);
return((count($intersect_a_b)/(count($a_arr)+count($b_arr)))*2);
}
?>
You may see, that if the result is 1, it means that it's the same sentence OR it uses the same words in a different order.
However, the smaller the value is, the more unique the "sentence" is. This is rather a simple implementation. You may set a limit value for example 0.4. And set the "request" in a queue if it passes this limit. And then take a look manualy at the listing. This is not "efficient". But i gave you the idea, and it's up to you to develop a more complex and automated system/algorithm. And maybe you should also take a look here.