How to manage “paging” with Solr?

2019-02-07 14:16发布

问题:

I have a classifieds website... I have Solr doing the searching of the classifieds, and then return ID:nrs which I then use to put into an array. Then I use this array to find any classifieds in a MySql db where the ID:s match the ID:s in the array returned by Solr.

Now, because this array can be very very big (100thousand records or more) then I would need to "page" the results so that maybe 100 where returned at a time. And then use those 100 ID:s in MySql to find the classifieds.

So, is it possible to page with SOLR?

And if so, how? I need example code... And what the results would be please.

Mostly I need a thorough example!

Thanks

回答1:

Take a look at IBM. Maybe that will get you on the right course.

Number of results: Specifies the maximum number of results to return.

Start: The offset to start at in the result set. This is useful for pagination.

So you probably want some variation on

<str name="rows">10</str>
<str name="start">0</str>

Your solr client should provide some way to get the total number of results without much trouble.



回答2:

Paging is managed with the start and rows parameters, e.g.:

?q=something&rows=10&start=20

will give you 10 documents, starting at the document 20.

About getting other information from MySQL, you're on your own. Me and other people already suggested to you to store everything in Solr to avoid the additional queries to MySQL.



回答3:

Probably a bit old question and a lot of helpful answers and recommendations, but I'll try to summarize the results and describe solution for paginating large data sets using cursor, bec. I faced this issue recently.

As mentioned by Yonik the problem of usual start/rows is that when we have large dataset and start is a bit further (much more further) than zero we have nice overhead in terms of efficiency and memory. It is because fetching of 20 documents from the "middle" of 500K records + using sorting, at least requires sorting of all dataset (sorting of internal unique's). Moreover, if search is distributed it will be even more resource consuming, bec. dataset (of 500 020 rows) from each shard should be returned to the aggregator node to be merged, to find out applicable 20 rows.

Solr can't compute which matching document is the 999001st result in sorted order, without first determining what the first 999000 matching sorted results are.


The solution here is to use Solr cursorMark.

On the first query you are announcing that the &cursorMark=*. It means next:

You can think of this being analogous to start=0 as a way to tell Solr "start at the beginning of my sorted results" except that it also informs Solr that you want to use a Cursor.

! One "caveat" here is that your sort clauses must include the uniqueKey field. It can be id field if its unique.

A part of first query will look like this:

?sort=price desc,id asc&start=0&cursorMark=* ...

As the result you will receive next structure

{
    "response":{"numFound":20,"start":0,"docs":[ /* docs here */ ]},
    "nextCursorMark":"AoIIRPoAAFBX" // Here is cursor mark for next "page"
}

To retrieve the next page, the next query will look next:

?sort=price desc,id asc&start=0&cursorMark=AoIIRPoAAFBX ...

Notice the cursorMark from previous response. And as the result you will get next page of results (same structure as the first response, but with another nextCursorMarker value). And so on ...

This approach ideally fits to infinite scroll pagination, but to use it within classic pagination there are some things to think about :).

Here are some reference materials I found solving this problem, hope it will help someone to get it done.

  • Pagination of results
  • Sorting, Paging, and Deep Paging in Solr (the Yonik's material) (Thanks a lot!)
  • Efficient Cursor Based Iteration of Large Result Sets


回答4:

The "start" parameter controls the offset into the search results, and the "rows" parameter controls how many documents to return from there.

If you are doing "deep paging" (iterating over many pages), then you can achieve much better performance using a cursor to iterate over the result set.



回答5:

I think that it is worth to say that solr returns together with the current page results a count of the total number of records found.

For example calling:

http://192.168.0.1:8983/solr/select?qt=edismax&fl=*,score&qf=content^2%20metatag.description^3%20title^5%20metatag.keywords^10&q=something&start=20&rows=10&wt=xml&version=2.2

The response is:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="fl">*,score</str>
            <str name="q">something</str>
            <str name="qf">content^2 metatag.description^3 title^5 metatag.keywords^10</str>
            <str name="qt">edismax</str>
            <str name="wt">xml</str>
            <str name="rows">10</str>
            <str name="version">2.2</str>
            </lst>
        </lst>
        <result name="response" numFound="1801" start="0" maxScore="0.15953878">
            <doc>...</doc>
            <doc>...</doc>
            <doc>...</doc>
...

Using solrj, the method query returns a SolrDocumentList that has the method: getNumFound().