I'm trying to optimize my elasticsearch scheme.
I have a field which is a URL - I do not want to be able to query or filter it, just retreive it.
My understanding is that a field that is defined as "index":"no"
is not indexed, but is still stored in the index.
(see slide 5 in http://www.slideshare.net/nitin_stephens/lucene-basics)
This should match to Lucene UnIndexed, right?
This confuses me, is there a way to store some fields, without them taking more storage than simply their content, and without encumbering the index for the other fields?
What am I missing?
I'm new to posting on stack exchange but believe I can help a bit!
There are a few considerations here:
Analyzing
As you don't want to do extra work you should set
"index": "no"
. This will mean the field will not be run through any tokenizers and filters.Furthermore it will not be searchable when directing a query at the specific field: (no hits)
*here "url" is the field name.
However the field will still be searchable in the
_all
field: (might have a hit)_all
fieldBy default every field gets put in the
_all
field. Set"include_in_all": "false"
to stop that. This might not be an issue with you as you are unlikely to search against the_all
field with a URL by mistake.I was working with a schema where countries were stored as 2 letter codes, e.g.: "NO" means Norway, and it is possible someone might do a search against the all field with "NO", so I make sure to set
"include_in_all": "false"
.Note: Any query where you don't specify a field explicitly will be executed against the
_all
field.Storing
By default, elasticsearch will store your entire document (unanalyzed, as you sent it) and this will be returned to you in a hit's
_source
field. If you turn this off (if your elasticsearch db is getting huge perhaps?) then you need to explicitly set"store": "yes"
to store fields individually. (One thing to notice is thatstore
takesyes
orno
and nottrue
orfalse
- it tripped me up!)Note: if you do this you will need to request the fields you want returned to you explicitly. e.g.:
finally...
I would leave elasticsearch to store your whole document (as the default) and use the following mapping.
Source: elasticsearch documentation
There are two ways to input data into the index. Indexing and Storing. Indexing a piece of data means that it is tokenized, and placed in the inverted index, and can be searched. Storing data means it is not tokenized, or analyzed or anything, and is not added to the inverted index. It is stored in an entirely separate area, in it's full text form. It can not be searched against, but can be retrieved, in it's original form, by it's document ID.
The typical Lucene query process is to query against indexed data, and get the back Document IDs of matching documents, then to use those document IDs to retrieve the stored data for those documents, and display it to the user.
Data which is indexed, but not stored is searchable, but can not be retrieved in it's original form.
Data which is stored, but not indexed can be retrieved once you have found a hit, but is not searchable.
Data which is indexed and stored can be searched or retrieved.
Data which is neither can not be added to the index at all.
This is covered a bit in the Lucene FAQ.
You are looking for the
'index' => 'not_analyzed'
mapping option.Also, if you use the
_source
, you do not have to specify thestore => false
option.