Get top 100 most used three word phrases in all do

2019-04-14 10:49发布

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:

Something like this:

Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]

I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.

My current mapping and settings:

{
  "mappings": {
    "items": {
      "properties": {
        "body": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

1条回答
相关推荐>>
2楼-- · 2019-04-14 11:37

What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")

Take a look here: https://www.elastic.co/blog/searching-with-shingles

Basically, you need a field with a shingle analyzer producing solely 3-term shingles:

Elastic blog-post configuration but with:

"filter_shingle":{
   "type":"shingle",
   "max_shingle_size":3,
   "min_shingle_size":3,
   "output_unigrams":"false"
}

The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.

{
  "size" : 0,
  "query" : {
    "match_all" : {}
  },
  "aggs" : {
    "three-word-phrases" : {
      "terms" : {
        "field" : "body",
        "size"  : 100  
      }
    }
  }
}
查看更多
登录 后发表回答