Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, william(dot)duclot(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower
Date: 2022-07-07 18:02:24
Message-ID: 3011319.1657216944@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2022-07-06 23:13:18 -0400, Tom Lane wrote:
>> comparing the cost estimate for the generic plan to the cost estimate
>> for a custom plan is not really logically valid, because those estimates
>> are founded on different statistics. I don't know how to fix that :-(.

> I think there's something more fundamentally wrong - somehow we end up with
> assuming > 50% selectivity on both the min and the max initplan, for the same
> condition!

Well, sure, because it *is* the same condition. AFAICS this is operating
as designed. Do I wish it were better? Sure, but there is no simple fix
here.

The reasoning that's being applied in the generic plan is

(1) default selectivity estimate for a scalar inequality is
#define DEFAULT_INEQ_SEL 0.3333333333333333

(2) therefore, the filter condition on the indexscan will select a random
one-third of the table;

(3) therefore, the LIMIT will be able to stop after about three rows,
whichever direction we scan in.

The information that is lacking is that the "id" and "timestamp"
columns are heavily correlated, so that we may have to scan far more
than three rows in "id" order before finding a row satisfying the
inequality on "timestamp". This is a problem we've understood for
a long time --- I recall talking about it at PGCon a decade ago.

The extended stats machinery provides a framework wherein we could
calculate and save the ordering correlation between the two columns,
but I don't believe it actually calculates that number yet --- I think
the functional-dependency stuff is close but not the right thing.
Even if we had the stats, it's not very clear where to fit this
type of consideration into the planner's estimates.

> In this case the whole generic plan part seems like a red herring. The generic
> plan is *awful* and would still be awful if the value were known, but
> somewhere around the middle of the value range.

If the value were somewhere around the middle (which is more or less
what we're assuming for the generic plan), then an indexscan on the
timestamp column isn't going to be that great either; you'd still
be scanning half the table.

> FWIW, manually writing the min/max as ORDER BY timestamp ASC/DESC LIMIT 1
> queries yields a *vastly* better plan:

Those queries give the wrong answers. We're looking for the min or max
id, not the id associated with the min or max timestamp. (They're
accidentally the same with this toy dataset.)

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2022-07-07 22:38:49 Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.
Previous Message Michael Paquier 2022-07-07 05:04:13 Re: pg_upgrade (12->14) fails on aggregate

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2022-07-07 18:07:26 Re: remove more archiving overhead
Previous Message Greg Stark 2022-07-07 17:53:16 Re: explain analyze rows=%.0f