Solr - count documents in the range of two date fi

2019-07-23 06:32发布

问题:

Here are some example Solr documents I got:

{
  "id": "1",
  "openDate": "2017-12-01T00:00:00.000Z",
  "closeDate": "2017-12-04T00:00:00.000Z"
},
{
  "id": "2",
  "openDate": "2017-12-02T00:00:00.000Z",
  "closeDate": "2017-12-04T00:00:00.000Z"
},
{
  "id": "3",
  "openDate": "2017-12-02T00:00:00.000Z",
  "closeDate": "2017-12-06T00:00:00.000Z" 
}

The dates that a document is "active" are the dates between the openDate (inclusive) and the closeDate (exclusive). I want to count the number of documents that are "active" on each day, so the output should be:

[
  {
    Date: 2017-12-01,
    count: 1
  },
  {
    Date: 2017-12-02,
    count: 3
  },
  {
    Date: 2017-12-03,
    count: 3
  },
  {
    Date: 2017-12-04,
    count: 1
  },
  {
    Date: 2017-12-05,
    count: 1
  }
]

One easy approach to solve this is to keep a multi-valued date field (say called openDates) with all the dates in the range of interest, so we expand the documents like this:

  {
    "id": "1",
    "openDate": "2017-12-01T00:00:00.000Z",
    "closeDate": "2017-12-04T00:00:00.000Z",
    "openDates": ["2017-12-01T00:00:00.000Z",
                  "2017-12-02T00:00:00.000Z",
                  "2017-12-03T00:00:00.000Z"]
  },
  {
    "id": "2",
    "openDate": "2017-12-02T00:00:00.000Z",
    "closeDate": "2017-12-04T00:00:00.000Z",
    "openDates": ["2017-12-02T00:00:00.000Z",
                  "2017-12-03T00:00:00.000Z"]    
  },
  {
    "id": "3",
    "openDate": "2017-12-02T00:00:00.000Z",
    "closeDate": "2017-12-06T00:00:00.000Z",
    "openDates": ["2017-12-02T00:00:00.000Z",
                  "2017-12-03T00:00:00.000Z",
                  "2017-12-04T00:00:00.000Z",
                  "2017-12-05T00:00:00.000Z"]    
  }

Then I can run a facet query like this:

/select?q=*:*&facet=true&facet.field=openDates&rows=0

to get the counts I need.

Is there a better way to solve this in Solr?

Ideally, an alternate approach can help bucket by hour or minute, not just days. The above approach will have a very large multi-valued field if we go more granular. Also, is there a good way to fill the holes (i.e. missing dates) with zero counts?

回答1:

The DateRangeField will come for the rescue. In schema you will add something like this:

<fieldType name="range_date" class="solr.DateRangeField" />
<field name="active" type="range_date" indexed="true" stored="false"/>

You could specify active range like this:

doc1.addField("active", "[2017-12-01T00:00:00.000Z TO 2017-12-04T00:00:00.000Z]")

and later request range facets by this field.

Example of params with 1 day granularity (you could change the gap param for different values) :

      q.add("facet", "true")
      q.add("facet.range", "active")
      q.add("facet.range.start", "NOW/MONTH")
      q.add("facet.range.end", "NOW/MONTH+1MONTH")
      q.add("facet.range.include", "outer")
      q.add("facet.range.gap", "+1DAY")

I've added facet.range.include=outer to keep exact format response as you like (not including upper and lower bounds). You could change this parameter by choosing something you would like more.

You will get exactly what you need:

2017-12-01T00:00:00Z
1
2017-12-02T00:00:00Z
3
2017-12-03T00:00:00Z
3
2017-12-04T00:00:00Z
1
2017-12-05T00:00:00Z
1

Full example of the code is here.



标签: date solr facet