How to get date_part query to hit index?

2019-08-01 04:39发布

问题:

I have yet to be able to get this query to hit an index instead of performing a full scan - I have another query that uses date_part('day', datelocal) against an almost identical table (that table just has a bit less data but same structure) and that one will hit the index I created on the datelocal column (which is a timestamp without timezone). Query (this one performs a parallel seq scan on the table and does a memory quicksort):

SELECT
    date_part('hour', datelocal) AS hour,
    SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)

Here is the other one that does hit my datelocal index:

SELECT
    date_part('day', datelocal) AS day,
    SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
    SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpressionday
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_trunc('day', datelocal), date_part('day', datelocal)
ORDER BY date_trunc('day', datelocal)

Banging my head about this! Any ideas as to how I can speed up the first one or at least make it hit an index? I've tried creating an index on the datelocal field, a compound index on datelocal, gender, and views, and an expression index on date_part('hour', datelocal) but none of that has worked.

Schemas:

-- Table Definition ----------------------------------------------

CREATE TABLE reportimpression (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);
-- Table Definition ----------------------------------------------

CREATE TABLE reportimpressionday (
    datelocal timestamp without time zone,
    devicename text,
    network text,
    sitecode text,
    advertisername text,
    mediafilename text,
    gender text,
    agegroup text,
    views integer,
    impressions integer,
    dwelltime numeric
);

-- Indices -------------------------------------------------------

CREATE INDEX reportimpressionday_datelocal_index ON reportimpressionday(datelocal timestamp_ops);
CREATE INDEX reportimpressionday_detail_index ON reportimpressionday(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);

Explain (analyze, buffers) output:

Finalize GroupAggregate  (cost=999842.42..999859.67 rows=3137 width=24) (actual time=43754.700..43754.714 rows=24 loops=1)
  Group Key: (date_part('hour'::text, datelocal))
  Buffers: shared hit=123912 read=823290
  I/O Timings: read=81228.280
  ->  Sort  (cost=999842.42..999843.99 rows=3137 width=24) (actual time=43754.695..43754.698 rows=48 loops=1)
        Sort Key: (date_part('hour'::text, datelocal))
        Sort Method: quicksort  Memory: 28kB
        Buffers: shared hit=123912 read=823290
        I/O Timings: read=81228.280
        ->  Gather  (cost=999481.30..999805.98 rows=3137 width=24) (actual time=43754.520..43777.558 rows=48 loops=1)
              Workers Planned: 1
              Workers Launched: 1
              Buffers: shared hit=123912 read=823290
              I/O Timings: read=81228.280
              ->  Partial HashAggregate  (cost=998481.30..998492.28 rows=3137 width=24) (actual time=43751.649..43751.672 rows=24 loops=2)
                    Group Key: date_part('hour'::text, datelocal)
                    Buffers: shared hit=123912 read=823290
                    I/O Timings: read=81228.280
                    ->  Parallel Seq Scan on reportimpression  (cost=0.00..991555.98 rows=2770129 width=17) (actual time=13.097..42974.126 rows=2338145 loops=2)
                          Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-02-28 00:00:00'::timestamp without time zone))
                          Rows Removed by Filter: 6792750
                          Buffers: shared hit=123912 read=823290
                          I/O Timings: read=81228.280
Planning time: 0.185 ms
Execution time: 43777.701 ms

回答1:

Well, both your queries are on different tables (reportimpression vs. reportimpressionday), so the comparison of the two queries really isn't a comparison. Did you ANALYZE both? Various column statistics also may play a role. Index or table bloat may be different. Does a larger part of all rows qualify for Feb 2019? Etc.

One shot in the dark, compare the percentages for both tables:

SELECT tbl, round(share * 100 / total, 2) As percentage
FROM  (
   SELECT text 'reportimpression' AS tbl
        , count(*)::numeric AS total
        , count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')::numeric AS share
   FROM  reportimpression

   UNION ALL
   SELECT 'reportimpressionday'
        , count(*)
        , count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')
   FROM  reportimpressionday
  ) sub;

Is the one for reportimpression bigger? Then it might just exceed the number for which an index is expected to help.

Generally, your index reportimpression_datelocal_index on (datelocal) looks good for it, and reportimpression_viewership_index even allows index-only scans if autovacuum beats the write load on the table. (Though impressions & agegroup are just dead freight for this and it would work even better without).

Answer

You got 26.6 percent, and day is 26.4 percent for my query. For such a large percentage, indexes are typically not useful at all. A sequential scan is typically the fastest way. Only index-only scans may still make sense if the underlying table is much bigger. (Or you have severe table bloat, and less bloated indexes, which makes indexes more attractive again.)

Your first query may just be across the tipping point. Try narrowing the time frame until you see index-only scans. You won't see (bitmap) index scans with more then roughly 5 % of all rows qualifying (depends on many factors).

Queries

Be that as it may, consider these modified queries:

SELECT date_part('hour', datelocal)                AS hour
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpression
WHERE  datelocal >= '2019-02-01'
AND    datelocal <  '2019-03-01' -- '2019-02-28'  -- ?
GROUP  BY 1
ORDER  BY 1;

SELECT date_trunc('day', datelocal)                AS day
     , SUM(views) FILTER (WHERE gender = 'male')   AS male
     , SUM(views) FILTER (WHERE gender = 'female') AS female
FROM   reportimpressionday
WHERE  datelocal >= '2019-02-01'
AND    datelocal <  '2019-03-01'
GROUP  BY 1
ORDER  BY 1;

Major points

  • When using localized date format like '2-1-2019', go through to_timestamp() with explicit format specifiers. Else this depends on locale settings and might break (silently) when called from a session with different settings. Rather use ISO date / time formats as demonstrated which do not depend on locale settings.

  • Looks like you want to include the whole month of February. But your query misses out on the upper bound. For one, February may have 29 days. An datelocal < '2-28-2019' excludes all of Feb 28 as well. Use datelocal < '2019-03-01' instead.

  • It's cheaper to group & sort by the same expression as you have in the SELECT list if you can. So use date_trunc() there, too. Don't use different expressions without need. If you need the datepart in the result, apply it on the grouped expression, like:

    SELECT date_part('day', date_trunc('day', datelocal)) AS day
    ...
    GROUP  BY date_trunc('day', datelocal)
    ORDER  BY date_trunc('day', datelocal);
    

    A bit more noisy code, but faster (and possibly easier to optimize for the query planner, too).

  • Use the aggregate FILTER clause in Postgres 9.4 or later. It's cleaner and a bit faster. See:

    • How can I simplify this game statistics query?
    • For absolute performance, is SUM faster or COUNT?