Postgres function much slower when using input var

2020-03-06 07:07发布

问题:

I have a function in Postgres 8.3.5 that selects data from multiple tables and dumps the result in a single table:

create or replace function test_function_2(startdate timestamp, enddate timestamp)
returns void as $$
begin
     delete from cl_final_report;

     INSERT INTO cl_final_report
     SELECT 
         b.batchkey AS batchnumber, 
         pv.productkey, 
         p.name AS productname, 
         avg(r.value) AS avgchemlean, 
         sum(r.auxvalue) AS totalweight, 
         max(o.time) AS timecompleted
     FROM result r
     LEFT JOIN physicalvalue pv ON r.physicalvaluekey = pv.physicalvaluekey
     LEFT JOIN product p ON pv.productkey = p.productkey
     LEFT JOIN object o ON r.objectkey = o.objectkey
     LEFT JOIN batch b ON o.batchkey = b.batchkey
     WHERE pv.name = 'CL'::text AND
         and o.time between startdate and enddate
     GROUP BY b.batchkey, pv.productkey, p.name
end
$$ language plpgsql;

This function takes 113 seconds to complete using PgAdmin and executing this command:

select test_function_2('05/02/2013', '05/03/2013')

However, if I replace the input variables in the function with literals like this:

create or replace function test_function_2(startdate timestamp, enddate timestamp)
returns void as $$
begin
     delete from cl_final_report;

     INSERT INTO cl_final_report
     SELECT 
         b.batchkey AS batchnumber, 
         pv.productkey, 
         p.name AS productname, 
         avg(r.value) AS avgchemlean, 
         sum(r.auxvalue) AS totalweight, 
         max(o.time) AS timecompleted
     FROM result r
     LEFT JOIN physicalvalue pv ON r.physicalvaluekey = pv.physicalvaluekey
     LEFT JOIN product p ON pv.productkey = p.productkey
     LEFT JOIN object o ON r.objectkey = o.objectkey
     LEFT JOIN batch b ON o.batchkey = b.batchkey
     WHERE pv.name = 'CL'::text AND
         and o.time between '05/02/2013' and '05/03/2013'
     GROUP BY b.batchkey, pv.productkey, p.name
end
$$ language plpgsql;

The function executes in less than 5 seconds.

I'm new to Postgres so there's probably something I'm missing, but I can't seem to find an answer anywhere.

回答1:

You can force a new plan on each execution if you make the query dynamic:

create or replace function test_function_2(
    startdate timestamp, enddate timestamp
) returns void as $function$
begin
    delete from cl_final_report;

    execute $$
        INSERT INTO cl_final_report
        SELECT 
            b.batchkey AS batchnumber, 
            pv.productkey, 
            p.name AS productname, 
            avg(r.value) AS avgchemlean, 
            sum(r.auxvalue) AS totalweight, 
            max(o.time) AS timecompleted
        FROM result r
        LEFT JOIN physicalvalue pv ON r.physicalvaluekey = pv.physicalvaluekey
        LEFT JOIN product p ON pv.productkey = p.productkey
        LEFT JOIN object o ON r.objectkey = o.objectkey
        LEFT JOIN batch b ON o.batchkey = b.batchkey
        WHERE pv.name = 'CL'::text AND
            and o.time between $1 and $2
        GROUP BY b.batchkey, pv.productkey, p.name
    $$ using startdate, enddate;
end;
$function$ language plpgsql;

For it to work in 8.3 without using do string concatenation:

create or replace function test_function_2(
    startdate timestamp, enddate timestamp
) returns void as $function$
begin
    delete from cl_final_report;

    execute $$
        INSERT INTO cl_final_report
        SELECT 
            b.batchkey AS batchnumber, 
            pv.productkey, 
            p.name AS productname, 
            avg(r.value) AS avgchemlean, 
            sum(r.auxvalue) AS totalweight, 
            max(o.time) AS timecompleted
        FROM result r
        LEFT JOIN physicalvalue pv ON r.physicalvaluekey = pv.physicalvaluekey
        LEFT JOIN product p ON pv.productkey = p.productkey
        LEFT JOIN object o ON r.objectkey = o.objectkey
        LEFT JOIN batch b ON o.batchkey = b.batchkey
        WHERE pv.name = 'CL'::text AND
            and o.time between '$$ || startdate || $$' and '$$ || enddate || $$'
        GROUP BY b.batchkey, pv.productkey, p.name
    $$;
end;
$function$ language plpgsql;


回答2:

@A.H's explanation is accurate for PostgreSQL 9.1 or older. So it is applicable for the OP, who is using the outdated version 8.3.

However, PostgreSQL 9.2 brought a substantial update in this area. PL/pgSQL functions have become a lot smarter about when to replan. I am quoting the release notes for 9.2 here

E.5.3.1.3. Optimizer

Allow the planner to generate custom plans for specific parameter values even when using prepared statements (Tom Lane)

In the past, a prepared statement always had a single "generic" plan that was used for all parameter values, which was frequently much inferior to the plans used for non-prepared statements containing explicit constant values. Now, the planner attempts to generate custom plans for specific parameter values. A generic plan will only be used after custom plans have repeatedly proven to provide no benefit. This change should eliminate the performance penalties formerly seen from use of prepared statements (including non-dynamic statements in PL/pgSQL).

Bold emphasis mine.

Ergo: One solution for the OP would be to upgrade to PostgreSQL 9.2+ and everything should just work fine automatically.



回答3:

The query planner / optimizer can calculate a better plan when it has constants at hand.

When no constants are used the planner must generate a plan which is acceptable for all possible values of startdate and enddate. If the difference between these two values is very large, then a large part of the table must be fetched. In this case indizes are not used in most cases, because random access costs are higher than linear reads.

But when there are constants, then the planner can calculate based on gathered statistics, that the query will only touch a tiny fraction of the table and hence an index might be faster.

This is a common irk with the PostgreSQL query planner. The manual contains some hints in the PREPARE section (PREPARE is used internally by pl/pgsql):

In some situations, the query plan produced for a prepared statement will be inferior to the query plan that would have been chosen if the statement had been submitted and executed normally. This is because when the statement is planned and the planner attempts to determine the optimal query plan, the actual values of any parameters specified in the statement are unavailable. PostgreSQL collects statistics on the distribution of data in the table, and can use constant values in a statement to make guesses about the likely result of executing the statement. Since this data is unavailable when planning prepared statements with parameters, the chosen plan might be suboptimal. To examine the query plan PostgreSQL has chosen for a prepared statement, use EXPLAIN.