I've a large set of text documents which I will index with Solr, in a format where each line of text has associated metadata. For example:
#metadata1
A line of text.
#metadata2
Another long, broken line of
#metadata3
text that should be searchable.
I'd like to index this such that the content is searchable, including phrase matches spanning multiple lines, but not the metadata. However, I can't discard the metadata: I would like to have any matches still have the associated metadata.
E.g. A query for "line of text" would return 2 matches, one being the first line (and its associated metadata "metadata1") and the other being the second and third lines (with the associated "metadata1" and "metadata2" respectively).
Can anyone describe how this might be done, or reference a tutorial that would get me started?
Since Solr uses Lucene under the cover, you should start with the Lucene document model:
- index is a collection of documents
- A document is a sequence of fields.
- A field is a named sequence of terms.
- A term is a string.
Searching goes over one or more fields and returns documents as results. Therefore, if you want to have span queries over multiple lines, you will have to put them into one document, but then the "line of text" query will match only one document.
UPDATE: seems like it's possible to search across multiple fileds using FieldMaskingSpanQuery.
If you don't want to search over metadata lines that's doable (you simply won't index them). Also to include metadata into results (I guess you want to store this while indexing and retrieve at search time).
Configure your fieldType with a PatternReplaceCharFilterFactory that detects and removes the metadata at index-time, and make the field stored so that the metadata is returned on a match.