Solr - index JSON query string from database?

2019-09-17 05:57发布

问题:

I would like to know if it is possible to index data that contains a JSON string that can be decoded and each JSON value to be indexed with the separate values.

I am using the DIH to connect to a MySQL database and able to index the individual columns. The result would look like the following:

<response name="response" numFound="1" start="0" maxScore="2.7143538">
    ...
    <result name="response" numFound="1" start="0" maxScore="2.7143538">
        <doc>
        <float name="score">2.7143538</float>
        <str name="id">82</str>
        <str name="name">jorge</str>
        <str name="otherinfo">{"day":15,"year":1989,"month":"January"}</str>
    </doc>
</result>
</response>

The problem is that "otherinfo" is a JSON string that I would like to decode and have something like the following in my index:

<response name="response" numFound="1" start="0" maxScore="2.7143538">
    ...
    <result name="response" numFound="1" start="0" maxScore="2.7143538">
        <doc>
        <float name="score">2.7143538</float>
        <str name="id">82</str>
        <str name="name">jorge</str>
        <str name="day">15</str>
        <str name="year">1989</str>
        <str name="month">January</str>
    </doc>
</result>
</response>

Would this be possible to do at all with Solr?

Thanks in advance

回答1:

I commented on this. I decided that I should answer instead.

The fix for your issue isn't at the Solr level. You shouldn't be storing your data this way in the DB to begin with. In the long run, it would be better to fix this problem there, as opposed to trying to hack this at the Solr indexing level.

Your question proves that someone, probably an end user, is interested in searching by this data. This implies that it should probably be stored in the database as an actual Date or Timestamp field so that it can be properly selected or sorted on.

I'm sure people won't like that this doesn't exactly answer your question, but someone needs to tell you this.



回答2:

  1. If you know your way around Java you could write your own, custom transformer that would handle your specific case.

  2. Have you tried using DIH RegexTransformer to parse JSON?
    I think that should be doable, especially if you have fixed json format (doesn't contain document in document in document in ...).

  3. I've just noticed ScriptTransformer, which allows you to write your own parser. I think this is the way to go...



回答3:

Is the otherinfo field in the DB a JSON string to start with?


You would need dynamic fields (docs, explanation) and client-side code to let Solr store data with arbitary schema.

You would need to define dynamic fields in your schema like:

dyn_string_*: store text as it is
dyn_text__*: store text and index it for search

etc

Then you will need to tell DIH to map DB fields to solr dynamic fields (pseudocode warning; sorry, but I am not familiar with DIH):

Select
    day as dyn_number_day,
    name as dyn_text_name
from
    tablename

Edit

You do have the requirement to query into the data structure. This needs a schema-less datastore.

Document DBs like MongoDB offer exactly the functionality: store data on arbitary fields you determine at insert-time. And it can run any kind of ad-hoc query on your data.

I am not aware of a request handler that can index your data for that. You can write code that fetches updated (or added or removed) rows periodically, decodes the JSON field and index it to Solr.

I reccomend skinny data model to store attributes to properties independent of current DB schema. I asked a question ' Set intersection in MySQL: a clean way ' a while back.

Recap: MongoDB and friends contain exactly the functionality you need. If you want relations and referential integrity, you can keep using RDBMS. If you still want that JSON thing, develop an active system that will parse it and index it to solr. But I recommend moving to a skinny data model, since you can get the same (conditions apply!) query capabilities that Solr gives you by SQL.

Exotic technology: Graph databases like Neo4j contain document database functionality (ad-hoc queries) and relations: a relation directly links one node to another, no joins involved. So it's just one step short of referential integrity.