I have a words table with an index on (language_id, state). Here are the results for EXPLAIN ANALYZE:
No limit
explain analyze SELECT "words".* FROM "words" WHERE (words.language_id = 27) AND (state IS NULL);
Bitmap Heap Scan on words (cost=10800.38..134324.10 rows=441257 width=96) (actual time=233.257..416.026 rows=540556 loops=1)
Recheck Cond: ((language_id = 27) AND (state IS NULL))
-> Bitmap Index Scan on ls (cost=0.00..10690.07 rows=441257 width=0) (actual time=230.849..230.849 rows=540556 loops=1)
Index Cond: ((language_id = 27) AND (state IS NULL))
Total runtime: 460.277 ms
(5 rows)
Limit 100
explain analyze SELECT "words".* FROM "words" WHERE (words.language_id = 27) AND (state IS NULL) LIMIT 100;
Limit (cost=0.00..51.66 rows=100 width=96) (actual time=0.081..0.184 rows=100 loops=1)
-> Seq Scan on words (cost=0.00..227935.59 rows=441257 width=96) (actual time=0.080..0.160 rows=100 loops=1)
Filter: ((state IS NULL) AND (language_id = 27))
Total runtime: 0.240 ms
(4 rows)
Why is this happening? How can I get the index to be used in all cases?
Thanks.
I think that the PostreSQL query planner just thinks that in the second case - the one with the LIMIT - it's not worth applying the index as it [the LIMIT] is too small. So it's not an issue.
Take a look at the PostgreSQL documentation about Using EXPLAIN and Query Planning. The reason for the query planner to prefer a sequential scan over an index scan in the LIMIT 100
case is simply because the sequential scan is cheaper.
There is no ORDER BY
clause in the query, so the planner is ok with the first 100 (random) rows that match the filter condition. An index scan would require to read the index pages first and then read the data pages to fetch the according rows. The sequential scan only needs to read the data pages to fetch the rows. In your case table statistics seem to suggest that there are enough (random) rows that match the filter condition. The cost of sequential page reads to get the 100 rows is considered cheaper than the cost of reading the index first and then fetch the actual rows. You might see a different plan when you raise the limit or when less rows match the filter condition.
With the default settings the planner considers the cost of a random page read (random_page_cost) four times the cost of a sequential page read (seq_page_cost). These settings can be adjusted to tune query plans (e.g. when the whole database is in RAM a random page read is not more expensive than a sequential page read and an index scan should be preferred). You can also try out different query plans by enabling/disabling certain kinds of scans, e.g:
set enable_seqscan = [on | off]
set enable_indexscan = [on | off]
While it is possible to enable/disable certain kinds of scans on a global basis this should be only used ad hoc for debugging or troubleshooting on a per session basis.
Also run VACUUM ANALYZE words
before you test the query plans, otherwise an automatic vacuum (autovaccum) run between the tests might influence the results.
Without limit: rows=540556 loops=1 Total runtime: 460.277 ms
With limit: rows=100 loops=1 Total runtime: 0.240 ms
I don't see a problem here. If your query yields 500K rows, it will need more time.
It's also weird that the two queries return a different number of rows. I guess you have been inserting though... Uhm, what if you do a sub-select?
select * from (select ...) limit 100;