Re: Use extended statistics to estimate (Var op Var) clauses

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Use extended statistics to estimate (Var op Var) clauses
Date: 2021-06-14 16:28:01
Message-ID: 9ef36aff-ca24-92b7-9f03-aa7ecfab36ee@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/14/21 5:36 PM, Mark Dilger wrote:
>
>
>> On Jun 13, 2021, at 1:28 PM, Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> Here is a slightly updated version of the patch
>
> Thanks for taking this up again!
>
> Applying the new test cases from your patch, multiple estimates have
> gotten better. That looks good. I wrote a few extra test cases and
> saw no change, which is fine. I was looking for regressions where
> the estimates are now worse than before. Do you expect there to be
> any such cases?
>

Not really. These clauses could not be estimated before, we generally
used default estimates for them. So

WHERE a = b

would use 0.5%, while

WHERE a < b

would use 33%, and so on. OTOH it depends on the accuracy of the
extended statistics - particularly the MCV list (what fraction of the
data it covers, etc.).

So it's possible the default estimate is very accurate by chance, and
MCV list represents only a tiny fraction of the data. Then the new
estimate could we worse. Consider for example this:

create table t (a int, b int);
insert into t select 100, 100 from generate_series(1,5000) s(i);
insert into t select i, i+1 from generate_series(1,995000) s(i);

This has exactly 0.5% of rows with (a=b). Without extended stats it's
perfect:

explain analyze select * from t where a = b;

Seq Scan on t (cost=0.00..16925.00 rows=5000 width=8)
(actual time=0.064..159.928 rows=5000 loops=1)

while with statistics it gets worse:

create statistics s (mcv) on a, b from t;
analyze t;

Seq Scan on t (cost=0.00..16925.00 rows=9810 width=8)
(actual time=0.059..160.467 rows=5000 loops=1)

It's not terrible, although we could construct worse examples. But the
same issue applies to other clauses, not just to these new ones. And it
relies on the regular estimation producing better estimate by chance.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2021-06-14 16:33:18 Re: recent failures on lorikeet
Previous Message Tomas Vondra 2021-06-14 16:10:28 Re: a path towards replacing GEQO with something better