I am using Solr's DataImportHandler to import data from a database. Some of the records have empty strings if there is no value for that column.
Currently the configuration I have produces Solr documents like this:
{
"x": "value",
"y": "",
"z": 2
}
However I would like to ignore all fields that have no value so that documents like this are created:
{
"x": "value",
"z": 2
}
Is there something I can define in the configuration file for the DataImportHandler that will give me my desired results?
One of the little-realized aspects of Solr is that you can plug UpdateRequestProcessor to run after the DIH. And, there are specialized URPs specifically for this problem.
So you could do something like this:
<updateRequestProcessorChain name="skip-empty">
<!-- Next two processors affect all fields - default configuration -->
<processor class="TrimFieldUpdateProcessorFactory" /> <!-- Get rid of leading/trailing spaces. Also empties all-spaces fields for next filter-->
<processor class="RemoveBlankFieldUpdateProcessorFactory" /> <!-- Delete fields with no content. More efficient and allows to query for presence/absence of field -->
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Obviously, remember to also reference this chain in the DIH's handler's definition:
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
....
<str name="update.chain">skip-empty</str>
</lst>
</requestHandler>
You can see the full list of the UpdateRequestProcessors at http://solr-start.com
You can either do this in SQL as I suggested in the comment above, or if you want to have a solution in the DIH processor chain, using the ScriptTransformer is a possibility. The ScriptTransformer will allow you to write a small Javascript to check if any column is an empty string, and use row.remove(fieldname) to get rid of that field completely.
If you want to write it in pure Java instead, you can also create a reusable custom transformer for DIH.