From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru> |
Cc: | Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: eqjoinsel_semi still sucks ... |
Date: | 2023-06-28 16:27:59 |
Message-ID: | 1235050.1687969679@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru> writes:
> On 23.06.2023 14:28, Andrey Lepikhov wrote:
>> On 2/5/2012 20:34, Tom Lane wrote:
>>> On reflection I think that the idea of clamping ndistinct beforehand is
>>> just wrong, and what we ought to do instead is apply a multiplier to the
>>> selectivity estimate afterwards.
>> I got stuck in some cases where (due to a tree of filters) the planner
>> underestimates the JOIN just because the ndistinct conveys a huge
>> number to the selectivity estimation formula. However, the estimation
>> of both input relations is made correctly and is limited.
>> I've tried to understand the logic through commits 0d3b231eebf,
>> 97930cf578e and 7f3eba30c9d. But it is still not clear.
>> So, why the idea of clamping ndistinct is terrible in general? Could
>> you explain your reasons a bit more?
> Hi, I also have an interest in understanding the same issue, and I dug
> into the above commits and the topic . I hope this letter will help to
> sort out this issue.
Note that nothing's actually been done about the idea I expressed at the
start of this thread. The behavior is better now: if you try the example
in v12 or later you get
regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.unique2 and b.two = 0);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=532.50..1059.38 rows=5000 width=244) (actual time=1.993..3.959 rows=5090 loops=1)
Hash Cond: (a.thousand = b.unique2)
-> Seq Scan on tenk1 a (cost=0.00..445.00 rows=10000 width=244) (actual time=0.003..0.502 rows=10000 loops=1)
-> Hash (cost=470.00..470.00 rows=5000 width=4) (actual time=1.960..1.960 rows=5000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 240kB
-> Seq Scan on tenk1 b (cost=0.00..470.00 rows=5000 width=4) (actual time=0.003..1.449 rows=5000 loops=1)
Filter: (two = 0)
Rows Removed by Filter: 5000
Planning Time: 0.760 ms
Execution Time: 4.087 ms
(10 rows)
I believe this improvement just stems from commit a314c3407 ("Clamp
semijoin selectivity to be not more than inner-join selectivity"),
and so this example looks good only because the actual semijoin
output size happens to be the same as the inner-join output size.
With a slight adjustment, that's no longer true and the estimate
still sucks:
regression=# explain analyze select * from tenk1 a, tenk1 b where a.thousand = b.fivethous and b.two = 0;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=532.50..1127.50 rows=10000 width=488) (actual time=2.000..5.742 rows=10000 loops=1)
...
regression=# explain analyze select * from tenk1 a where exists(select 1 from tenk1 b where a.thousand = b.fivethous and b.two = 0);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=532.50..1127.50 rows=10000 width=244) (actual time=1.529..3.474 rows=5000 loops=1)
...
regression=# explain analyze select * from tenk1 a where not exists(select 1 from tenk1 b where a.thousand = b.fivethous and b.two = 0);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=532.50..1027.50 rows=1 width=244) (actual time=1.564..3.476 rows=5000 loops=1)
...
So we still have an issue to fix. I've not had time to pursue
the idea I expressed at the start of the thread. Also, it's a
bit scary to change this logic with only a few examples to look
at. If you could reduce the problem cases you're looking at
to something sharable, that could help move things along.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2023-06-28 16:30:14 | tablecmds.c/MergeAttributes() cleanup |
Previous Message | Peter Eisentraut | 2023-06-28 14:52:27 | several attstattarget-related improvements |