Searching on expression indexes

2020-04-14 16:54发布

问题:

Searching on expression indexes

I'm building an expression index that substrings a source field to avoid overflowing the 2172 character limit on B-trees:

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail 
    USING btree ((substring(old_value,1,1024)::text) text_pattern_ops);

For the record:

-- Postgres 11.4 on RDS, 11.5 on macOS at home.
-- The record_changes_log_detail table has about 8M in my test setup.
-- The old_value field is of type citext.
-- Values in the field range in length from 1 character to over 5,000. Most are short.

This search uses the index specified above:

select * from record_changes_log_detail 
where substring(old_value,1,1024) = 'Gold Kerrison Neuro';

This search does not use the index:

where old_value = 'Gold Kerrison Neuro';

I found this surprising, and a real bummer. If I understand correctly a comment from jjanes on another question, the planner only recognizes that the index applies when your query statement uses exactly the same expression. In other words, anyone writing a query needs to know the details of the index definition or else the index won't be used.

I had been assuming that when the expression index was constructed, the abbreviated/extracted/etc. value was stored and that the planner would check it. Is there any way to hint the planner other than to recapitulate the entire expression? The index has the right data, but the planner seems to skip over it.

I'm adding a bit of detail based on Erwin Brandstetter's answer:

I have lots of similar situations, which is why I'm digging in here on the details. In this case, of my ~8M rows, only 6 have values longer than 2172 characters, and 99.93% of the values are 100 characters or less.

What I'm hoping for is an approach that is easy for someone else to pick up. A shadow-field might well be the answer as having to know the exact details of index constructions strikes me as exactly the wrong sort of visibility. A shadow field doesn't suffer from that problem, once you know to use it. I could either populate it with LEFT(old_field,128) or some other length, or texthash(old_field), as you mentioned. I'll experiment with that. My data is so skewed to short values, that hashing seems to lead to a high collision rate.

For what it's worth, the team and I are coming from a system where text fields are silently trimmed to 1024 characters when indexed in a B-tree. It's completely transparent to the user and searches consult the index. Apples and spark plugs, I know. The point is that I'm not expecting Postgres to be an AI, but I am coming with inaccurate priors. So thanks to you and everyone else for helping me learn more about how Postgres actually works.


Follow-up

This question has been answered, but I want to add some follow-up for the archives. I've been learning a lot from old answers, some very old. So here's a bit of information for the future. I tried out four solutions:

  • B-tree on a portion of the citext field.
  • B-tree on a hash of the citext field.
  • Hash index of the citext field.
  • Tri-gram GIN index of the citext field.

Since there doesn't appear to be any way to get LIKE-type queries on citext where the text may be too long, the goal is to create an index for =. Any of the three above will work okay, but they differ quite a bit. Here's some setup code for the test:

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_btree;
CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail
    USING btree ((left(old_value,1024)::citext) citext_pattern_ops);


DROP INDEX IF EXISTS record_changes_log_detail_old_value_hash_ix_btree;
CREATE INDEX record_changes_log_detail_old_value_hash_ix_btree
    ON record_changes_log_detail
    USING btree (hashtext(old_value));

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_hash;
CREATE INDEX record_changes_log_detail_old_value_ix_hash
    ON record_changes_log_detail
    USING hash (old_value);    

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_tgrm;
CREATE INDEX record_changes_log_detail_old_value_ix_tgrm
    ON record_changes_log_detail 
    USING gin (old_value gin_trgm_ops);

VACUUM ANALYZE;

These indexes each work to find a record, but with different syntax:

-- Uses the LEFT()::citext index
explain analyze
select * from record_changes_log_detail 
where left(old_value,1024)::citext = 'Gold Kerrison Neuro';

-- Uses the HASH index
explain analyze
select * from record_changes_log_detail 
where old_value = 'Gold Kerrison Neuro';

-- Uses the HASHTEXT() index
explain analyze
select * from record_changes_log_detail 
where hashtext(old_value) = hashtext('Gold Kerrison Neuro');

-- Uses the tri-gram() index
explain analyze
select * from record_changes_log_detail 
where old_value::text LIKE '%Gold Kerrison Neuro%';

The hash index provides the nicest syntax because it's transparent...but the hash index is the worst in every other way. Here's a size search and results. I've added reported index building times manually here.

select
'B-tree on LEFT(old_value,1024)::citext' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_btree')) as pretty

union all

select
'B-tree on HASHTEXT(old_value)' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_hash_ix_btree')) as pretty

union all

select
'Hash index on old_value' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_hash')) as pretty

union all

select
'GIN tri-gram index on old_value' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_tgrm')) as pretty;


index_description                       pretty  seconds
B-tree on LEFT(old_value,1024)::citext  238 MB       38
B-tree on HASHTEXT(old_value)           166 MB        7
Hash index on old_value                 362 MB    3,802
GIN tri-gram index on old_value         106 MB       56

I'd say this data is a terrible match for a hash index, so please don't take those results as typical. Still, the time and size are pretty bad. The clear winner for = searches is Erwin Brandstetter's clever suggestion to B-tree a hash. Nice! The extra syntactic sugar needed for the search isn't as bad here as for the LEFT-based index. Looking forward, this will benefit from the B-tree improvements promised in PG 12.

And more good news, the tri-gram index is awesome. Laurenz Albe suggested trying it out, and I'm happy I did. Instantaneous contains/like searches, perfect. That's just what I need. Here again, I doubt that the index size is typical...my data is weird. For those using citext, note that you have to cast the search condition to text for the index to be used:

select * from record_changes_log_detail 
where old_value::text LIKE '%Gold Kerrison Neuro%';

For those who don't know, tri-grams are an instance of n-grams with a length of 3. N-grams are sometimes called q-grams or k-grams. Whatever, it's the same thing. Of all of the naive (non-probabilistic or statistical) fuzzy text matching algorithms, it's probably the best. Robust over different data sets and languages, flexible, awesome. So I'm super pleased at how well it works in Postgres.

回答1:

It's just like you read from jjanes elsewhere: an expression index is only considered if the expression is matched in the query predicate exactly. The Postgres query planner is not an AI. It would quickly defeat the purpose of making queries fast if planning them takes too long.

You can optimize your index a bit, if that's any consolation. left() is simpler and faster than substring():

CREATE INDEX record_changes_log_detail_old_value_ix_btree
ON record_changes_log_detail (left(old_value,1024) text_pattern_ops);

Also, there is a maximum row size of 2704 bytes for btree indexes, not a "2172 character limit on B-trees".

Most importantly, for only equality checks, like your question suggests, a btree index on a hash value using md5(old_value) or hashtext(old_value) would be much more efficient. If you do, remember to defend against hash collisions like so:

SELECT *
FROM   record_changes_log_detail 
WHERE  hashtext(old_value) = hashtext('Gold Kerrison Neuro')
AND    old_value = 'Gold Kerrison Neuro';

The first predicate gives you fast index access. The second excludes false positives. Collisions should be extremely rare. But possible. And the possibility grows with the size of the table.

Related:

  • SELECT query with DISTINCT on a table-structure for graphs is very slow
  • What is the optimal data type for an MD5 field?
  • Full-text search in CouchDB

Or a hash index like you have been considering yourself already:

  • Why is a Postgres 11 hash index so large?

(Here you don't need to worry about hash collisions; handled internally.)