I have the following query in PostgreSQL (9.5.1):
select e.id, (select count(id) from imgitem ii where ii.tabid = e.id and ii.tab = 'esp') as imgs,
e.ano, e.mes, e.dia, cast(cast(e.ano as varchar(4))||'-'||right('0'||cast(e.mes as varchar(2)),2)||'-'|| right('0'||cast(e.dia as varchar(2)),2) as varchar(10)) as data,
pl.pltag, e.inpa, e.det, d.ano anodet, coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')' determinador, d.tax, coalesce(v.val,v.valf)||' '||vu.unit as altura,
coalesce(v1.val,v1.valf)||' '||vu1.unit as DAP, d.fam, tf.nome família, d.gen, tg.nome gênero, d.sp, ts.nome espécie, d.inf, e.loc, l.nome localidade, e.lat, e.lon
from esp e
left join det d on e.det = d.id
left join tax tf on d.fam = tf.oldfam
left join tax tg on d.gen = tg.oldgen
left join tax ts on d.sp = ts.oldsp
left join tax ti on d.inf = ti.oldinf
left join loc l on e.loc = l.id
left join pess p on p.id = d.detby
left join var v on v.esp = e.id and v.key = 265
left join varunit vu on vu.id = v.unit
left join var v1 on v1.esp = e.id and v1.key = 264
left join varunit vu1 on vu1.id = v1.unit
left join pl on pl.id = e.pl
WHERE unaccent(TEXT(coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')')) ilike unaccent('%vicen%')
It takes 430ms to retrieve 1129 rows from a total of 9250 in esp
table.
If I change the search term from %vicen%
to %vicent%
(adding a 't'), it takes 431ms to retrieve the same 1129 rows.
Ordering by the search column, ascending and descending, I see that all 1129 rows have exactly the same name in both cases.
Now the strange: if I change the search term from %vicent%
to %vicenti%
(adding an 'i'), now it takes unbelievable 24.4 seconds to retrieve the same 1129 rows!
The searched term is always in the first coalesce
, i.e. coalesce(p.abrev,'')
. I expect the query to run slower or faster, depending on the size of the searched string, but not that much!! Anyone has any idea of what's going on?
Results of EXPLAIN ANALYZE
(would exceed the 30k character limit here):
For %vicen%
: http://explain.depesz.com/s/2XF
For %vicenti%
: http://explain.depesz.com/s/dEc6
Why?
The reason is this:
Fast query:
-> Hash Left Join (cost=1378.60..2467.48 rows=15 width=79) (actual time=41.759..85.037 rows=1129 loops=1)
...
Filter: (unaccent(((((COALESCE(p.abrev, ''::character varying))::text || ' ('::text) || (COALESCE(p.prenome, ''::character varying))::text) || ')'::text)) ~~* (...)
Slow query:
-> Hash Left Join (cost=1378.60..2467.48 rows=1 width=79) (actual time=35.084..80.209 rows=1129 loops=1)
...
Filter: (unaccent(((((COALESCE(p.abrev, ''::character varying))::text || ' ('::text) || (COALESCE(p.prenome, ''::character varying))::text) || ')'::text)) ~~* unacc (...)
Extending the search pattern by another character causes Postgres to assume yet fewer hits. (Typically, this is a reasonable estimate.) Postgres obviously does not have precise enough statistics (none, actually, keep reading) to expect the same number of hits that you really get.
This causes a switch to a different query plan, which is even less optimal for the actual number of hits rows=1129
.
Solution
Assuming current Postgres 9.5 since it has not been declared.
One way to improve the situation is to create an expression index on the expression in the predicate. This makes Postgres gather statistics for the actual expression, which can help the query even if the index itself is not used for the query. Without the index, there are no statistics for the expression at all. And if done right the index can be used for the query, that's even much better. But there are multiple problems with your current expression:
unaccent(TEXT(coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')')) ilike unaccent('%vicen%')
Consider this updated query, based on some assumptions about your undisclosed table definitions:
SELECT e.id
, (SELECT count(*) FROM imgitem
WHERE tabid = e.id AND tab = 'esp') AS imgs -- count(*) is faster
, e.ano, e.mes, e.dia
, e.ano::text || to_char(e.mes2, 'FM"-"00')
|| to_char(e.dia, 'FM"-"00') AS data
, pl.pltag, e.inpa, e.det, d.ano anodet
, format('%s (%s)', p.abrev, p.prenome) AS determinador
, d.tax
, coalesce(v.val,v.valf) || ' ' || vu.unit AS altura
, coalesce(v1.val,v1.valf) || ' ' || vu1.unit AS dap
, d.fam, tf.nome família, d.gen, tg.nome AS gênero, d.sp
, ts.nome AS espécie, d.inf, e.loc, l.nome localidade, e.lat, e.lon
FROM pess p -- reorder!
JOIN det d ON d.detby = p.id -- INNER JOIN !
LEFT JOIN tax tf ON tf.oldfam = d.fam
LEFT JOIN tax tg ON tg.oldgen = d.gen
LEFT JOIN tax ts ON ts.oldsp = d.sp
LEFT JOIN tax ti ON ti.oldinf = d.inf -- unused, see @joop's comment
LEFT JOIN esp e ON e.det = d.id
LEFT JOIN loc l ON l.id = e.loc
LEFT JOIN var v ON v.esp = e.id AND v.key = 265
LEFT JOIN varunit vu ON vu.id = v.unit
LEFT JOIN var v1 ON v1.esp = e.id AND v1.key = 264
LEFT JOIN varunit vu1 ON vu1.id = v1.unit
LEFT JOIN pl ON pl.id = e.pl
WHERE f_unaccent(p.abrev) ILIKE f_unaccent('%' || 'vicenti' || '%') OR
f_unaccent(p.prenome) ILIKE f_unaccent('%' || 'vicenti' || '%');
Major points
Why f_unaccent()
? Because unaccent()
can't be indexed. Read this:
- Does PostgreSQL support "accent insensitive" collations?
I used the function outlined there to allow the following (recommended!) multicolumn functional trigram GIN index:
CREATE INDEX pess_unaccent_nome_trgm_idx ON pess
USING gin (f_unaccent(pess) gin_trgm_ops, f_unaccent(prenome) gin_trgm_ops);
If you are not familiar with trigram indexes, read this first:
- PostgreSQL LIKE query performance variations
And possibly:
- Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Be sure to run the latest version of Postgres (currently 9.5). There have been substantial improvements to GIN indexes. And you'll be interested in improvements in pg_trgm 1.2, scheduled to be released with the upcoming Postgres 9.6:
- Trigram search gets much slower as search string gets longer
Prepared statements are a common way to execute queries with parameters (especially with text from user input). Postgres has to find a plan that works best for any given parameter. Add wildcards as constants to the to the search term like this:
f_unaccent(p.abrev) ILIKE f_unaccent('%' || 'vicenti' || '%')
('vicenti'
would be replaced with a parameter.) So Postgres knows we are dealing with a pattern that is neither anchored left nor right - which would allow different strategies. Related answer with more details:
- Performance impact of empty LIKE in a prepared statement
Or maybe re-plan the query for every search term (possibly using dynamic SQL in a function). But make sure planning time isn't eating any possible performance gain.
The WHERE
condition on columns in pess
contradicts the LEFT JOIN
. Postgres is forced to convert that to an INNER JOIN
. What's worse the join comes late in the join tree. And since Postgres cannot reorder your joins (see below), that can become very expensive. Move the table to the first position in the FROM
clause to eliminate rows early. Following LEFT JOIN
s do not eliminate any rows by definition. But with that many tables it is important to move joins that might multiply rows to the end.
You are joining 13 tables, 12 of them with LEFT JOIN
which leaves 12!
possible combinations - or 11! * 2!
if we take the one LEFT JOIN
into account that's really an INNER JOIN
. That's too many for Postgres to evaluate all possible permutations for the best query plan. Read about join_collapse_limit
:
- Sample Query to show Cardinality estimation error in PostgreSQL
- SQL INNER JOIN over multiple tables equal to WHERE syntax
The default setting for join_collapse_limit
is 8, which means that Postgres won't try to reorder tables in your FROM
clause and the order of tables is relevant.
One way work around this would be to split the performance-critical part into a CTE like @joop commented. Don't set join_collapse_limit
much higher or times for query planning involving many joined tables will deteriorate.
About your concatenated date named data
:
cast(cast(e.ano as varchar(4))||'-'||right('0'||cast(e.mes as varchar(2)),2)||'-'|| right('0'||cast(e.dia as varchar(2)),2) as varchar(10)) as data
Assuming you build from three numeric columns for year, month and day, which are defined NOT NULL
, use this instead:
e.ano::text || to_char(e.mes2, 'FM"-"00')
|| to_char(e.dia, 'FM"-"00') AS data
About the FM
template pattern modifier:
- Check for integer in string array
But really, you should store the date as data type date
to begin with.
Also simplified:
format('%s (%s)', p.abrev, p.prenome) AS determinador
Won't make the query faster, but it's much cleaner. See format()
.
First things last, all the usual advice for performance optimization applies:
- Keep PostgreSQL from sometimes choosing a bad query plan
If you get all of this right, you should see much faster queries for all patterns.
A way to reduce the size of the range table is to squeeze out a trivial part of the query into a CTE, such AS:
WITH zzz AS (
SELECT l.id, l.nome
, coalesce(v.val,v.valf)||' '||vu.unit as altura
, coalesce(v1.val,v1.valf)||' '||vu1.unit as DAP
FROM loc l
left join var v on v.esp = l.id and v.key = 265
left join varunit vu on vu.id = v.unit
left join var v1 on v1.esp = l.id and v1.key = 264
left join varunit vu1 on vu1.id = v1.unit
)
select e.id, (select count(id) from imgitem ii
where ii.tabid = e.id and ii.tab = 'esp'
) as imgs
, e.ano, e.mes, e.dia
, cast(cast(e.ano as varchar(4))||'-'||right('0'||cast(e.mes as varchar(2)),2)||'-'|| right('0'||cast(e.dia as varchar(2)),2) as varchar(10)) as data
, pl.pltag, e.inpa, e.det, d.ano anodet
, coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')' determinador
, d.tax
, zzz.altura as altura
, zzz.DAP as DAP
, d.fam, tf.nome família
, d.gen, tg.nome gênero
, d.sp , ts.nome espécie
, d.inf, e.loc
, zzz.nome AS localidade
, e.lat, e.lon
from esp e
left join det d on e.det = d.id -- these could possibly be
left join pess p on p.id = d.detby -- plain joins
--
left join tax tf on d.fam = tf.oldfam
left join tax tg on d.gen = tg.oldgen
left join tax ts on d.sp = ts.oldsp
-- ### commented out, since it is never referred
-- ### left join tax ti on d.inf = ti.oldinf
left join pl on pl.id = e.pl
left JOIN zzz ON zzz.id = e.loc
--
WHERE unaccent(TEXT(coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')')) ilike unaccent('%vicen%')
;
[untested, since I don't have the table definitions]