I'm using Solr to index records consisting of binary fields. I've specified the fields in schema.xml as such:
<field name="id" type="binary" indexed="true" stored="true" required="true" multiValued="false" />
I'm able to add records to the index via a POST request, encoding and sending the fields as Base64 Strings. The size of the collection's data directory is growing so I know it is storing something; however, when doing a match all query (q=*:*) I strangely get some documents found but none returned, e.g.:
"response": {
"numFound": 364047,
"start": 0,
"maxScore": 1,
"docs": []
}
Has anybody any idea what's causing this or how it can be resolved?
Thanks
Short answer it cannot be solved.
When having a read in the reference documentation of Solr, you find there very few information about the BinaryField type
Class: BinaryField
Description: Binary data.
The current state is that this BinaryField is only intended for storage of binary data. Nothing more, nothing less. There is however an issue to change this, but it has not raised that much attention yet.
My personal assumption is that behind this lies the fact that binary data is just not plain and simple binary data. Most of the time it is an elaborated file format that requires special interpretation. For this task a separate Apache Project exists, Apache Tika.
To tame this beast several good articles and tutorials are spread all over the web. A good starting point how to integrate this with Solr is also found in the reference documentation (1, 2).