Elasticsearch match substring in php

2019-04-22 15:32发布

Below given is my code to generate index using elasticsearch.Index is getting generated successfully.Basically I am using it to generate autosuggest depending upon movie name,actor name and gener.

Now my requirement is, I need to match substring with particular field.This is working fine if I use $params['body']['query']['wildcard']['field'] = '*sub_word*';.(i.e. search for 'to' gives 'tom kruz' but search for 'tom kr' returns no result).

This matches only particular word in string.I want to match substring containing multiple words(i.e. 'tom kr' should return 'tom kruz').

I found few docs, saying it will be possible using 'ngram'. But I don't know, how should I implement it in my code, as I am using array based configurations for elasticsearch and all support docs are mentioning configuration in json fromat.

Please help.

require 'vendor/autoload.php';

$client = \Elasticsearch\ClientBuilder::create()
->setHosts(['http://localhost:9200'])->build();

/*************Index a document****************/
$params = ['body' => []];
$j = 1;
for ($i = 1; $i <= 100; $i++) {
    $params['body'][] = [
        'index' => [
            '_index' => 'pvrmod',
            '_type' => 'movie',
            '_id' => $i
        ]
    ];
    if ($i % 10 == 0) 
        $j++;
    $params['body'][] = [
        'title' => 'salaman khaan'.$j,
        'desc' => 'salaman khaan description'.$j,
        'gener' => 'movie gener'.$j,
        'language' => 'movie language'.$j,
        'year' => 'movie year'.$j,
        'actor' => 'movie actor'.$j,
    ];

    // Every 10 documents stop and send the bulk request
    if ($i % 10 == 0) {
        $responses = $client->bulk($params);

        // erase the old bulk request
        $params = ['body' => []];

        unset($responses);
    }
}

// Send the last batch if it exists
if (!empty($params['body'])) {
    $responses = $client->bulk($params);
}

2条回答
太酷不给撩
2楼-- · 2019-04-22 15:56

Try to create this JSON

{
"query": {
    "filtered": {
        "query": {
            "bool": {
                "should": [
                    {
                        "wildcard": {
                            "field": {
                                "value": "tom*",
                                "boost": 1
                            }
                        }
                    },
                    {
                        "field": {
                            "brandname": {
                                "value": "kr*",
                                "boost": 1
                            }
                        }
                    },
                ]
            }
        }
    }
}

You can explode your search term

$searchTerms = explode(' ', 'tom kruz');

And then create the wildcard for each one

foreach($searchTerms as $searchTerm) {
//create the new array
}
查看更多
贼婆χ
3楼-- · 2019-04-22 16:00

The problem here lies in the fact that Elasticsearch builds an inverted index. Assuming you use the standard analyser, the sentence "tom kruz is a top gun" get's split into 6 tokens: tom - kruz - is - a - top - gun. These tokens get assigned to the document (with some metadata about there position but let's leave that on the side for now).

If you want to make a partial match, you can, but only on the separate tokens, not over the boundary of tokens as you would like. The suggestion for splitting your search string and building a wildcard query out of these strings is an option.

Another option would indeed be using an ngram or edge_ngram token filter. What that would do (at index time) is creating those partial tokens (like t - to - tom - ... - k - kr - kru - kruz - ...) in advance and you can just put in 'tom kr' in your (match) search and it would match. Be careful though: this will bloat your index (as you can see, it will store A LOT more tokens), you need custom analysers and probably quite a bit of knowledge about your mappings.

In general, the (edge_)ngram route is a good idea only for things like autocomplete, not for just any text field in your index. There's a few ways to get around your problem but most involve building separate features to detect misspelled words and trying to suggest the right terms for it.

查看更多
登录 后发表回答