Re: eqjoinsel_semi still sucks ...

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru>
Cc: Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: eqjoinsel_semi still sucks ...
Date: 2023-06-28 16:27:59
Message-ID: 1235050.1687969679@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru> writes:
> On 23.06.2023 14:28, Andrey Lepikhov wrote:
>> On 2/5/2012 20:34, Tom Lane wrote:
>>> On reflection I think that the idea of clamping ndistinct beforehand is
>>> just wrong, and what we ought to do instead is apply a multiplier to the
>>> selectivity estimate afterwards.

>> I got stuck in some cases where (due to a tree of filters) the planner
>> underestimates the JOIN just because the ndistinct conveys a huge
>> number to the selectivity estimation formula. However, the estimation
>> of both input relations is made correctly and is limited.
>> I've tried to understand the logic through commits 0d3b231eebf,
>> 97930cf578e and 7f3eba30c9d. But it is still not clear.
>> So, why the idea of clamping ndistinct is terrible in general? Could
>> you explain your reasons a bit more?

> Hi, I also have an interest in understanding the same issue, and I dug
> into the above commits and the topic . I hope this letter will help to
> sort out this issue.

Note that nothing's actually been done about the idea I expressed at the
start of this thread. The behavior is better now: if you try the example
in v12 or later you get

regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.unique2 and b.two = 0);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=532.50..1059.38 rows=5000 width=244) (actual time=1.993..3.959 rows=5090 loops=1)
Hash Cond: (a.thousand = b.unique2)
-> Seq Scan on tenk1 a (cost=0.00..445.00 rows=10000 width=244) (actual time=0.003..0.502 rows=10000 loops=1)
-> Hash (cost=470.00..470.00 rows=5000 width=4) (actual time=1.960..1.960 rows=5000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 240kB
-> Seq Scan on tenk1 b (cost=0.00..470.00 rows=5000 width=4) (actual time=0.003..1.449 rows=5000 loops=1)
Filter: (two = 0)
Rows Removed by Filter: 5000
Planning Time: 0.760 ms
Execution Time: 4.087 ms
(10 rows)

I believe this improvement just stems from commit a314c3407 ("Clamp
semijoin selectivity to be not more than inner-join selectivity"),
and so this example looks good only because the actual semijoin
output size happens to be the same as the inner-join output size.
With a slight adjustment, that's no longer true and the estimate
still sucks:

regression=# explain analyze select * from tenk1 a, tenk1 b where a.thousand = b.fivethous and b.two = 0;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=532.50..1127.50 rows=10000 width=488) (actual time=2.000..5.742 rows=10000 loops=1)
...

regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.thousand = b.fivethous and b.two = 0);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=532.50..1127.50 rows=10000 width=244) (actual time=1.529..3.474 rows=5000 loops=1)
...

regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.fivethous and b.two = 0);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=532.50..1027.50 rows=1 width=244) (actual time=1.564..3.476 rows=5000 loops=1)
...

So we still have an issue to fix. I've not had time to pursue
the idea I expressed at the start of the thread. Also, it's a
bit scary to change this logic with only a few examples to look
at. If you could reduce the problem cases you're looking at
to something sharable, that could help move things along.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-06-28 16:30:14 tablecmds.c/MergeAttributes() cleanup
Previous Message Peter Eisentraut 2023-06-28 14:52:27 several attstattarget-related improvements