Locality-sensitive hashing - Elasticsearch

2020-08-09 11:18发布

问题:

is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it? Thanks

Edit: I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What would be a good setting to find duplicates?

回答1:

  1. There is a Elasticsearch MinHash Plugin. You can use it to extract minhash value every time you index a document and query the document by minhash later.

    1. Install MinHash plugin:

      $ $ES_HOME/bin/plugin install org.codelibs/elasticsearch-minhash/2.3.1
      
    2. Add a minhash analyzer when creating your index:

      $ curl -XPUT 'localhost:9200/my_index' -d '{
        "index":{
          "analysis":{
            "analyzer":{
              "minhash_analyzer":{
                "type":"custom",
                "tokenizer":"standard",
                "filter":["minhash"]
              }
            }
          }
        }
      }'  
      
    3. Put minhash_value field into an index mapping:

      $ curl -XPUT "localhost:9200/my_index/my_type/_mapping" -d '{
        "my_type":{
          "properties":{
            "message":{
              "type":"string",
              "copy_to":"minhash_value"
            },
            "minhash_value":{
              "type":"minhash",
              "minhash_analyzer":"minhash_analyzer"
            }
          }
        }
      }'
      
    4. The minhash value is calculated automatically when adding document to the index you have created with minhash analyzer.
    5. a. Use More like this query can be used to do "like" search on the minhash_value field:

      GET /_search
      {
          "query": {
              "more_like_this" : {
                  "fields" : ["minhash_value"],
                  "like" : "KV5rsUfZpcZdVojpG8mHLA==",
                  "min_term_freq" : 1,
                  "max_query_terms" : 12
              }
          }
      }
      

      b. You can also use fuzzy query but it accepts the query to differ from the result by 2 (maximum).

      GET /_search
      {
          "query": {
             "fuzzy" : { "minhash_value" : "KV5rsUfZpcZdVojpG8mHLA==" }
          }
      } 
      

      You can find more about the fuzzy query here.

  2. Or you can create the hash value outside of elasicsearch (write a code to extract hash value) and everytime you index a document you can run the code and attach the hash value to the document you are indexing. And later search with the hash value using More Like This query or Fuzzy query as described above.
  3. Last but not least, you can write elasticsearch plugin yourself as above (which suits you hashing algorithm) and do the same step above.