I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your
body
field to see the top one-hundred 3-word phrases.