I've got some PDF attachments being indexed in Elasticsearch, using the Tire gem. It's all working great, but I'm going to have many GB of PDFs, and we will likely store the PDFs in S3 for access. Right now the base64-encoded PDFs are being stored in Elasticsearch _source, which will make the index huge. I want to have the attachments indexed, but not stored, and I haven't yet figured out the right incantation to put in Tire's "mapping" block to prevent it. The block is like this right now:
mapping do
indexes :id, :type => 'integer'
indexes :title
indexes :last_update, :type => 'date'
indexes :attachment, :type => 'attachment'
end
I've tried some variations like:
indexes :attachment, :type => 'attachment', :_source => { :enabled => false }
And it looks nice when I run the tire:import rake task, but it doesn't seem to make a difference. Does anyone know A) if this is possible? and B) how to do it?
Thanks in advance.
The _source field settings contain a list of fields what should be excluded from the source. I would guess that in case of tire, something like this should do it:
mapping :_source => { :excludes => ['attachment'] } do
indexes :id, :type => 'integer'
indexes :title
indexes :last_update, :type => 'date'
indexes :attachment, :type => 'attachment'
end
@imotov 's solution does not work for me. When I execute the curl command
curl -X GET "http://localhost:9200/user_files/user_file/_search?pretty=true" -d '{"query":{"query_string":{"query":"rspec"}}}'
I can still see the content of the attachment file included in the search results.
"_source" : {"user_file":{"id":5,"folder_id":1,"updated_at":"2012-08-16T11:32:41Z","attachment_file_size":179895,"attachment_updated_at":"2012-08-16T11:32:41Z","attachment_file_name":"hw4.pdf","attachment_content_type":"application/pdf","created_at":"2012-08-16T11:32:41Z","attachment_original":"JVBERi0xL .....
Here's my implementation:
include Tire::Model::Search
include Tire::Model::Callbacks
def self.search(folder, params)
tire.search() do
query { string params[:query], default_operator: "AND"} if params[:query].present?
filter :term, folder_id: folder.id
highlight :attachment_original, :options => {:tag => "<em>"}
end
end
mapping :_source => { :excludes => ['attachment_original'] } do
indexes :id, :type => 'integer'
indexes :folder_id, :type => 'integer'
indexes :attachment_file_name
indexes :attachment_updated_at, :type => 'date'
indexes :attachment_original, :type => 'attachment'
end
def to_indexed_json
to_json(:methods => [:attachment_original])
end
def attachment_original
if attachment_file_name.present?
path_to_original = attachment.path
Base64.encode64(open(path_to_original) { |f| f.read })
end
end