Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.)
This is my index settings:
"settings": {
"index": {
"analysis": {
"filter": {
"whitespace_remove": {
"type": "pattern_replace",
"pattern": " ",
"replacement": ""
}
},
"analyzer": {
"meliuz_analyzer": {
"filter": [
"lowercase",
"whitespace_remove"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
Instead of "pattern": " "
, I tried "pattern": "\\u0020"
and \\s
, too.
But when I analyze the text "beleza na web", it still creates three separate tokens: "beleza", "na" and "web", instead of one single "belezanaweb".