Re: A new strategy for pull-up correlated ANY_SUBLINK

From: Alena Rybakina Andy Fan pgsql-hackers , Tom Lane , vignesh C , Richard Guo , Andrey Lepikhov Re: A new strategy for pull-up correlated ANY_SUBLINK 2023-10-11 21:01:37 b9437269-0480-4016-91bf-61168f72bdd3@yandex.ru Raw Message | Whole Thread | Download mbox | Resend email 2022-11-02 03:02:58 from Andy Fan 📎  2022-11-02 03:42:28 from Andrey Lepikhov   2022-11-02 05:34:54 from Andy Fan  2022-11-12 22:45:43 from Tom Lane 📎   2022-11-15 01:02:13 from Richard Guo   2023-01-06 06:16:23 from vignesh C    2023-01-31 17:47:31 from vignesh C     2023-02-14 02:11:23 from Andy Fan   2023-04-05 07:15:42 from Andy Fan 📎    2023-10-11 21:01:37 from Alena Rybakina 📎     2023-10-12 07:52:07 from Andy Fan      2023-10-12 23:14:13 from Alena Rybakina 📎       2023-10-13 07:04:45 from Andy Fan        2023-10-13 07:29:25 from Andy Fan 📎        2023-10-13 08:39:41 from Alena Rybakina 📎         2024-01-26 12:46:02 from vignesh C          2024-02-13 10:50:40 from Alexander Korotkov 📎           2024-02-15 09:51:28 from Andy Fan      2024-07-01 09:17:50 from Andrei Lepikhov       2024-07-03 08:33:44 from Andrei Lepikhov pgsql-hackers

Hi!

I reviewed your patch and it was interesting for me!

Thank you for the explanation. It was really informative for me!

>
> I think we need the restriction and that should be enough for this feature
> . Given the query Richard provided before:
>
> explain
> select * from tenk1 A where exists
> (select 1 from tenk2 B
> where A.hundred in (select C.hundred FROM tenk2 C
> WHERE c.odd = b.odd));
>
> It first can be converted to the below format without any issue.
>
> SELECT * FROM tenk1 A SEMI JOIN tenk2 B
> on A.hundred in (select C.hundred FROM tenk2 C
> WHERE c.odd = b.odd);
>
> Then without the restriction, since we only pull the varnos from
> sublink->testexpr, then it is {A}, so it convert to
>
> SELECT * FROM
> (tenk1 A SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C)
> ON c.odd = b.odd AND a.hundred = v.hundred)
> SEMI JOIN on tenk2 B ON TRUE;
>
> then the above query is NOT A VALID QUERY since:
> 1. The above query is *not* same as
>
> SELECT * FROM (tenk1 A SEMI JOIN tenk2 B) on true
> SEMI JOIN LATERAL (SELECT c.hundred FROM tenk2 C) v
> ON v.odd = b.odd;
>
> 2. The above query requires b.odd when B is not available. So it is
> right that an optimizer can't generate a plan for it. The fix would
> be to do the restriction before applying this optimization.
>
> I'm not sure pull-up-subquery can play any role here, IIUC, the bad thing
> happens before pull-up-subquery.
>
> I also write & analyze more test and found no issue by me
>
> 1. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
> ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
> ==> should not be pull-up to rarg of the left join since A.hundred is not
> available.
>
> 2.  SELECT * FROM tenk1 A LEFT JOIN tenk2 B
> ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = a.odd);
> ==> should not be pull-up to rarg of the left join since A.odd is not
> available.
>
> 3. SELECT * FROM tenk1 A LEFT JOIN tenk2 B
> ON B.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
> ==> should be pull-up to rarg of left join.
>
> 4. SELECT * FROM tenk1 A INNER JOIN tenk2 B
> ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
> ==> pull-up as expected.
>
> 5. SELECT * FROM tenk1 A RIGHT JOIN tenk2 B
> ON A.hundred in (SELECT c.hundred FROM tenk2 C WHERE c.odd = b.odd);
> ==> should not be pull-up into larg of left join since b.odd is not
> available.
>
>
After reviewing, I want to suggest some changes related to the code and
tests.

First of all, I think, it would be better to "treat" change to
"consider" and rewrite the pull-up check condition in two lines:

/*
* If the sub-select refers to any Vars of the parent query, we so let's
* considering it as LATERAL.  (Vars of higher levels don't matter here.)
*/

use_lateral = !bms_is_empty(sub_ref_outer_relids) &&
bms_is_subset(sub_ref_outer_relids, available_rels);

if (!use_lateral && !bms_is_empty(sub_ref_outer_relids))
return NULL;

Secondly, I noticed another interesting feature in your patch and I
think it could be added to the test.

If we get only one row from the aggregated subquery, we can pull-up it
in the subquery scan filter.

postgres=# explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);

QUERY PLAN
--------------------------------------------------------------
Nested Loop Left Join
->  Seq Scan on tenk1 a
->  Materialize
->  Nested Loop
->  Seq Scan on tenk2 b
*->  Subquery Scan on "ANY_subquery"
Filter: (b.hundred = "ANY_subquery".min)*
->  Aggregate
->  Seq Scan on tenk2 c
Filter: (odd = b.odd)
(10 rows)

It was impossible without your patch:

postgres=# explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
QUERY PLAN
---------------------------------------------------
Nested Loop Left Join
->  Seq Scan on tenk1 a
->  Materialize
->  Seq Scan on tenk2 b
Filter: (SubPlan 1)
SubPlan 1
->  Aggregate
->  Seq Scan on tenk2 c
Filter: (odd = b.odd)
(9 rows)

And I found an alternative query, when aggregated sublink will pull-up
into JoinExpr condition.

explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON B.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
QUERY PLAN
-------------------------------------------------------------
Nested Loop Left Join
->  Seq Scan on tenk1 a
->  Materialize
->  Hash Semi Join
*Hash Cond: (b.hundred = "ANY_subquery".count)*
->  Seq Scan on tenk2 b
->  Hash
->  Subquery Scan on "ANY_subquery"
->  HashAggregate
Group Key: c.odd
->  Seq Scan on tenk2 c
(11 rows)

Unfortunately, I found a request when sublink did not pull-up, as in the
examples above. I couldn't quite figure out why.

create table a (x int, y int, z int, t int);
create table b (x int, t int);
create unique index on a (t, x);
create index on b (t,x);
insert into a select id, id, id, id FROM generate_series(1,100000) As id;
insert into b select id, id FROM generate_series(1,1000) As id;

explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
left join a
on b.x=a.x and
*b.t in
(select max(a0.t) *
from a a0
where a0.x = b.x and
a0.t = b.t);

QUERY PLAN
------------------------------------------------------------------------------------------------------------
Hash Right Join (actual time=1.150..58.512 rows=1000 loops=1)
Hash Cond: (a.x = b.x)
*Join Filter: (SubPlan 2)*
Buffers: shared hit=3546
->  Seq Scan on a (actual time=0.023..15.798 rows=100000 loops=1)
Buffers: shared hit=541
->  Hash (actual time=1.038..1.042 rows=1000 loops=1)
Buckets: 4096  Batches: 1  Memory Usage: 72kB
Buffers: shared hit=5
->  Seq Scan on b (actual time=0.047..0.399 rows=1000 loops=1)
Buffers: shared hit=5
SubPlan 2
->  Result (actual time=0.018..0.018 rows=1 loops=1000)
Buffers: shared hit=3000
InitPlan 1 (returns \$2)
->  Limit (actual time=0.015..0.016 rows=1 loops=1000)
Buffers: shared hit=3000
->  Index Only Scan using a_t_x_idx on a a0 (actual
time=0.014..0.014 rows=1 loops=1000)
Index Cond: ((t IS NOT NULL) AND (t = b.t) AND
(x = b.x))
Heap Fetches: 1000
Buffers: shared hit=3000
Planning Time: 0.630 ms
Execution Time: 58.941 ms
(23 rows)

I thought it would be:

explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
left join a on
b.x=a.x and
*b.t =
(select max(a0.t) *
from a a0
where a0.x = b.x and
a0.t <= b.t);

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1)
Hash Cond: (a.x = b.x)
*Join Filter: (b.t = (SubPlan 2))*
Buffers: shared hit=3546
->  Seq Scan on a (actual time=0.022..17.109 rows=100000 loops=1)
Buffers: shared hit=541
->  Hash (actual time=1.065..1.068 rows=1000 loops=1)
Buckets: 4096  Batches: 1  Memory Usage: 72kB
Buffers: shared hit=5
->  Seq Scan on b (actual time=0.049..0.401 rows=1000 loops=1)
Buffers: shared hit=5
SubPlan 2
->  Result (actual time=0.025..0.025 rows=1 loops=1000)
Buffers: shared hit=3000
InitPlan 1 (returns \$2)
->  Limit (actual time=0.024..0.024 rows=1 loops=1000)
Buffers: shared hit=3000
->  Index Only Scan Backward using a_t_x_idx on a a0
(actual time=0.023..0.023 rows=1 loops=1000)
Index Cond: ((t IS NOT NULL) AND (t <= b.t)
AND (x = b.x))
Heap Fetches: 1000
Buffers: shared hit=3000
Planning Time: 0.689 ms
Execution Time: 68.220 ms
(23 rows)

If you noticed, it became possible after replacing the "in" operator
with "=".

reviewer, if you don't mind.

--
Regards,
Alena Rybakina

Attachment Content-Type Size
pull-up.diff text/x-patch 4.3 KB