Re: Parallelize correlated subqueries that execute within each worker

From: vignesh C <vignesh21(at)gmail(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Richard Guo <guofenglinux(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Parallelize correlated subqueries that execute within each worker
Date: 2024-01-09 07:08:48
Message-ID: CALDaNm3XwCCf=CXsrCrYSa3h1x9PXr2xV+=gMb+j8XEnpnQ8hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 4 Jul 2023 at 06:56, James Coleman <jtc331(at)gmail(dot)com> wrote:
>
> On Sun, Jun 11, 2023 at 10:23 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> >
> > ...
> > > And while trying the v9 patch I came across a crash with the query
> > > below.
> > >
> > > set min_parallel_table_scan_size to 0;
> > > set parallel_setup_cost to 0;
> > > set parallel_tuple_cost to 0;
> > >
> > > explain (costs off)
> > > select * from pg_description t1 where objoid in
> > > (select objoid from pg_description t2 where t2.description = t1.description);
> > > QUERY PLAN
> > > --------------------------------------------------------
> > > Seq Scan on pg_description t1
> > > Filter: (SubPlan 1)
> > > SubPlan 1
> > > -> Gather
> > > Workers Planned: 2
> > > -> Parallel Seq Scan on pg_description t2
> > > Filter: (description = t1.description)
> > > (7 rows)
> > >
> > > select * from pg_description t1 where objoid in
> > > (select objoid from pg_description t2 where t2.description = t1.description);
> > > WARNING: terminating connection because of crash of another server process
> > >
> > > Seems something is wrong when extracting the argument from the Param in
> > > parallel worker.
> >
> > With what I'm trying to change I don't think this plan should ever be
> > generated since it means we'd have to pass a param from the outer seq
> > scan into the parallel subplan, which we can't do (currently).
> >
> > I've attached the full backtrace to the email, but as you hinted at
> > the parallel worker is trying to get the param (in this case
> > detoasting it), but the param doesn't exist on the worker, so it seg
> > faults. Looking at this further I think there's an existing test case
> > that exposes the misplanning here (the one right under the comment
> > "Parallel Append is not to be used when the subpath depends on the
> > outer param" in select_parallel.sql), but it doesn't seg fault because
> > the param is an integer, doesn't need to be detoasted, and therefore
> > (I think) we skate by (but probably with wrong results in depending on
> > the dataset).
> >
> > Interestingly this is one of the existing test queries my original
> > patch approach didn't change, so this gives me something specific to
> > work with improving the path. Thanks for testing this and bringing
> > this to my attention!
>
> Here's what I've found debugging this:
>
> There's only a single gather path ever created when planning this
> query, making it easy to know which one is the problem. That gather
> path is created with this stacktrace:
>
> frame #0: 0x0000000105291590
> postgres`create_gather_path(root=0x000000013081ae78,
> rel=0x000000013080c8e8, subpath=0x000000013081c080,
> target=0x000000013081c8c0, required_outer=0x0000000000000000,
> rows=0x0000000000000000) at pathnode.c:1971:2
> frame #1: 0x0000000105208e54
> postgres`generate_gather_paths(root=0x000000013081ae78,
> rel=0x000000013080c8e8, override_rows=false) at allpaths.c:3097:4
> frame #2: 0x00000001052090ec
> postgres`generate_useful_gather_paths(root=0x000000013081ae78,
> rel=0x000000013080c8e8, override_rows=false) at allpaths.c:3241:2
> frame #3: 0x0000000105258754
> postgres`apply_scanjoin_target_to_paths(root=0x000000013081ae78,
> rel=0x000000013080c8e8, scanjoin_targets=0x000000013081c978,
> scanjoin_targets_contain_srfs=0x0000000000000000,
> scanjoin_target_parallel_safe=true, tlist_same_exprs=true) at
> planner.c:7696:3
> frame #4: 0x00000001052533cc
> postgres`grouping_planner(root=0x000000013081ae78, tuple_fraction=0.5)
> at planner.c:1611:3
> frame #5: 0x0000000105251e9c
> postgres`subquery_planner(glob=0x00000001308188d8,
> parse=0x000000013080caf8, parent_root=0x000000013080cc38,
> hasRecursion=false, tuple_fraction=0.5) at planner.c:1062:2
> frame #6: 0x000000010526b134
> postgres`make_subplan(root=0x000000013080cc38,
> orig_subquery=0x000000013080ff58, subLinkType=ANY_SUBLINK,
> subLinkId=0, testexpr=0x000000013080d848, isTopQual=true) at
> subselect.c:221:12
> frame #7: 0x0000000105268b8c
> postgres`process_sublinks_mutator(node=0x000000013080d6d8,
> context=0x000000016b0998f8) at subselect.c:1950:10
> frame #8: 0x0000000105268ad8
> postgres`SS_process_sublinks(root=0x000000013080cc38,
> expr=0x000000013080d6d8, isQual=true) at subselect.c:1923:9
> frame #9: 0x00000001052527b8
> postgres`preprocess_expression(root=0x000000013080cc38,
> expr=0x000000013080d6d8, kind=0) at planner.c:1169:10
> frame #10: 0x0000000105252954
> postgres`preprocess_qual_conditions(root=0x000000013080cc38,
> jtnode=0x000000013080d108) at planner.c:1214:14
> frame #11: 0x0000000105251580
> postgres`subquery_planner(glob=0x00000001308188d8,
> parse=0x0000000137010d68, parent_root=0x0000000000000000,
> hasRecursion=false, tuple_fraction=0) at planner.c:832:2
> frame #12: 0x000000010525042c
> postgres`standard_planner(parse=0x0000000137010d68,
> query_string="explain (costs off)\nselect * from pg_description t1
> where objoid in\n (select objoid from pg_description t2 where
> t2.description = t1.description);", cursorOptions=2048,
> boundParams=0x0000000000000000) at planner.c:411:9
>
> There aren't any lateral markings on the rels. Additionally the
> partial path has param_info=null (I found out from Tom in a separate
> thread [1] that this is only set for outer relations from the same
> query level).
>
> The only param that I could easily find at first was a single param of
> type PARAM_EXTERN in root->plan_params in make_subplan().
>
> I spent a lot of time trying to figure out where we could find the
> PARAM_EXEC param that's being fed into the subplan, but it doesn't
> seem like we have access to any of these things at the point in the
> path creation process that it's interesting to us when inserting the
> gather nodes.
>
> Given all of that I settled on this approach:
> 1. Modify is_parallel_safe() to by default ignore PARAM_EXEC params.
> 2. Add is_parallel_safe_with_params() that checks for the existence of
> such params.
> 3. Store the required params in a bitmapset on each base rel.
> 4. Union the bitmapset on join rels.
> 5. Only insert a gather node if that bitmapset is empty.
>
> I have an intuition that there's some spot (e.g. joins) that we should
> be removing params from this set (e.g., when we've satisfied them),
> but I haven't been able to come up with such a scenario as yet.
>
> The attached v11 fixes the issue you reported.

One of the tests has failed in CFBot at [1] with:
+++ /tmp/cirrus-ci-build/build/testrun/pg_upgrade/002_pg_upgrade/data/results/select_parallel.out
2023-12-20 20:08:42.480004000 +0000
@@ -137,23 +137,24 @@
explain (costs off)
select (select max((select pa1.b from part_pa_test pa1 where pa1.a = pa2.a)))
from part_pa_test pa2;
- QUERY PLAN
---------------------------------------------------------------
- Aggregate
+ QUERY PLAN
+--------------------------------------------------------------------
+ Finalize Aggregate
-> Gather
Workers Planned: 3
- -> Parallel Append
- -> Parallel Seq Scan on part_pa_test_p1 pa2_1
- -> Parallel Seq Scan on part_pa_test_p2 pa2_2
+ -> Partial Aggregate
+ -> Parallel Append
+ -> Parallel Seq Scan on part_pa_test_p1 pa2_1
+ -> Parallel Seq Scan on part_pa_test_p2 pa2_2
+ SubPlan 1
+ -> Append
+ -> Seq Scan on part_pa_test_p1 pa1_1
+ Filter: (a = pa2.a)
+ -> Seq Scan on part_pa_test_p2 pa1_2
+ Filter: (a = pa2.a)
SubPlan 2
-> Result
- SubPlan 1
- -> Append
- -> Seq Scan on part_pa_test_p1 pa1_1
- Filter: (a = pa2.a)
- -> Seq Scan on part_pa_test_p2 pa1_2
- Filter: (a = pa2.a)
-(14 rows)
+(15 rows)

More details of the failure is available at [2].

[1] - https://cirrus-ci.com/task/5685696451575808
[2] - https://api.cirrus-ci.com/v1/artifact/task/5685696451575808/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/regress_log_002_pg_upgrade

Regards,
Vignesh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-01-09 07:26:29 Re: INFORMATION_SCHEMA note
Previous Message Hayato Kuroda (Fujitsu) 2024-01-09 07:01:41 RE: speed up a logical replica setup