multi level join in solr

Hi i have data in a 3 level tree structure. Can I use SOlr JOIN to get the root node when the user searches 3rd level node.

FOr example -

    PATIENT1
       -> FirstName1
       -> LastName1
       -> DOCUMENTS1_1
            -> document_type1_1
            -> document_description1_1
            -> document_value1_1
            -> CODE_ITEMS1_1_1
                -> Code_id1_1_1
                -> code1_1_1
            -> CODE_ITEMS1_1_1
                -> Code_id1_1_2
                -> code1_1_2
       -> DOCUMENTS1_2
            -> document_type1_2
            -> document_description1_2
            -> document_value1_2
            -> CODE_ITEMS1_2_1
                -> Code_id1_2_1
                -> code1_2_1
            -> CODE_ITEMS1_2_2
                -> Code_id1_2_2
                -> code1_2_2
    PATIENT2
       -> FirstName2
       -> LastName2
       -> DOCUMENTS2_1
            -> document_type2_1
            -> document_description2_1
            -> document_value2_1
            -> CODE_ITEMS2_1_1
                -> Code_id2_1_1
                -> code2_1_1
            -> CODE_ITEMS2_1_2
                -> Code_id2_1_2
                -> code2_1_2

I want to search a CODE_ITEM and return all the patient that matches the code items search criteria. How can this be done. Is it possible to implement join twice. First join gives all the documents for the code_item search and the next join gives all the Patient.

Something like in SQL query -

select * from patients where docID (select DOCID from DOCUMENTS where CODEID IN (select CODEID from CODE_ITEMS where CODE LIKE '%SEARCH_TEXT%'))

回答1:

I really don't know how internally Solr joins work, but knowing that RDB multiple joins are extremely inefficient on large data sets, I'd probably end up writing my own org.apache.solr.handler.component.QueryComponent that would, after doing normal search, get root parent (of course, this approach requires that each child doc has a reference to its root patient).

If you choose to go this path I'll post some examples. I had similar (more complex - ontology) problem in one of my previous Solr projects.

The simpler way to go (simpler when it comes to solving this problem, not the whole approach) is to completely flatten this part of your schema and store all information (documents and code items) into its parent patient and just do a regular search. This is more in line with Solr (you have to look at Solr schema in a different way. It's nothing like your regular RDB normalized schema, Solr encourages data redundancy so that you may search blindingly fast without joins).

Third approach would be to do some joins testing on representative data sets and see how search performance is affected.

In the end, it really depends on your whole setup and requirements (and test results, of course).

EDIT 1: I did this couple of years back, so you'll have to figure out whether things changed in the mean time.

1. Create custom request handler

To do completely clean job, I suggest you define your own Request handler (in solrconfig.xml) by simply copying the whole section that starts with

<requestHandler name="/select" class="solr.SearchHandler"> ... ... </requestHandler>

and then changing name to something meaningful to your users, like e.g. /searchPatients. Also, add this part inside:

    <arr name="components">
            <str>patients</str>
            <str>facet</str>
            <str>mlt</str>
            <str>highlight</str>            
            <str>stats</str>
            <str>debug</str>
    </arr>

2. Create custom search component

Add this to your solrconfig:

<searchComponent name="patients" class="org.apache.solr.handler.component.PatientQueryComponent"/>

Create PatientQueryComponent class:
The following source probably has errors since I edited my original source in text editor and posted it without testing, but the important thing is that you get recipe, not finished source, right? I threw out caching, lazy loading, prepare method and left only the basic logic. You'll have to see how the performance will be affected and then tweak the source if needed. My performance was fine, but I had a couple of million documents in total in my index.

public class PatientQueryComponent extends SearchComponent {
...

    @Override
    public void process(ResponseBuilder rb) throws IOException {
        SolrQueryRequest req = rb.req;
        SolrQueryResponse rsp = rb.rsp;
        SolrParams params = req.getParams();
        if (!params.getBool(COMPONENT_NAME, true)) {
            return;
        }
        searcher = req.getSearcher();
        // -1 as flag if not set.
        long timeAllowed = (long)params.getInt( CommonParams.TIME_ALLOWED, -1 );

        DocList initialSearchList = null;

        SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand();
        cmd.setTimeAllowed(timeAllowed);
        cmd.setSupersetMaxDoc(UNLIMITED_MAX_COUNT);

        // fire standard query
        SolrIndexSearcher.QueryResult result = new SolrIndexSearcher.QueryResult();
        searcher.search(result, cmd);

        initialSearchList = result.getDocList();

        // Set which'll hold patient IDs
        List<String> patientIds = new ArrayList<String>();

        DocIterator iterator = initialSearchList.iterator();
        int id;

        // loop through search results
        while(iterator.hasNext()) {
            // add your if logic (doc type, ...)
            id = iterator.nextDoc();
            doc = searcher.doc(id); // , fields) you can try lazy field loading and load only patientID filed value into the doc
            String patientId = doc.get("patientID") // field that's in child doc and points to its root parent - patient
            patientIds.add(patientId);
        }

        // All all unique patient IDs in TermsFilter
        TermsFilter termsFilter = new TermsFilter();
        Term term;

        for(String pid : patientIds){
            term = new Term("patient_ID", pid); // field that's unique (name) to patient and holds patientID
            termsFilter.addTerm(term);
        }

        // get all patients whose ID is in TermsFilter
        DocList patientsList = null;        
        patientsList = searcher.getDocList(new MatchAllDocsQuery(), searcher.convertFilter(termsFilter), null, 0, 1000);

        long totalSize = initialSearchList.size() + patientsList.size();
        logger.info("Total: " + totalSize);

        SolrDocumentList solrResultList = SolrPluginUtils.docListToSolrDocumentList(patientsList, searcher, null, null);
        SolrDocumentList solrInitialList = SolrPluginUtils.docListToSolrDocumentList(initialSearchList, searcher, null, null);

        // Add patients to the end of the list
        for(SolrDocument parent : solrResultList){
            solrInitialList.add(parent);
        }

        // replace initial results in response
        SolrPluginUtils.addOrReplaceResults(rsp, solrInitialList);
        rsp.addToLog("hitsRef", patientsList.size());
        rb.setResult( result );
    }
}

回答2:

Take a look at this post: http://blog.griddynamics.com/2013/12/grandchildren-and-siblings-with-block.html

Actually you can do it in SOLR 4.5