Re: Parallelize correlated subqueries that execute within each worker

From: Antonin Houska <ah(at)cybertec(dot)at>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Parallelize correlated subqueries that execute within each worker
Date: 2023-02-06 16:40:55
Message-ID: 20283.1675701655@antos
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

James Coleman <jtc331(at)gmail(dot)com> wrote:

> On Sat, Jan 21, 2023 at 10:07 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> > ...
> > While working through Tomas's comment about a conditional in the
> > max_parallel_hazard_waker being guaranteed true I realized that in the
> > current version of the patch the safe_param_ids tracking in
> > is_parallel_safe isn't actually needed any longer. That seemed
> > suspicious, and so I started digging, and I found out that in the
> > current approach all of the tests pass with only the changes in
> > clauses.c. I don't believe that the other changes aren't needed;
> > rather I believe there isn't yet a test case exercising them, but I
> > realize that means I can't prove they're needed. I spent some time
> > poking at this, but at least with my current level of imagination I
> > haven't stumbled across a query that would exercise these checks.
>
> I played with this a good bit more yesterday, I'm now a good bit more
> confident this is correct. I've cleaned up the patch; see attached for
> v7.
>
> Here's some of my thought process:
> The comments in src/include/nodes/pathnodes.h:2953 tell us that
> PARAM_EXEC params are used to pass values around from one plan node to
> another in the following ways:
> 1. Values down into subqueries (for outer references in subqueries)
> 2. Up out of subqueries (for the results of a subplan)
> 3. From a NestLoop plan node into its inner relation (when the inner
> scan is parameterized with values from the outer relation)
>
> Case (2) is already known to be safe (we currently add these params to
> safe_param_ids in max_parallel_hazard_walker when we encounter a
> SubPlan node).
>
> I also believe case (3) is already handled. We don't build partial
> paths for joins when joinrel->lateral_relids is non-empty, and join
> order calculations already require that parameterization here go the
> correct way (i.e., inner depends on outer rather than the other way
> around).
>
> That leaves us with only case (1) to consider in this patch. Another
> way of saying this is that this is really the only thing the
> safe_param_ids tracking is guarding against. For params passed down
> into subqueries we can further distinguish between init plans and
> "regular" subplans. We already know that params from init plans are
> safe (at the right level). So we're concerned here with a way to know
> if the params passed to subplans are safe. We already track required
> rels in ParamPathInfo, so it's fairly simple to do this test.
>
> Which this patch we do in fact now see (as expected) rels with
> non-empty lateral_relids showing up in generate_[useful_]gather_paths.
> And the partial paths can now have non-empty required outer rels. I'm
> not able to come up with a plan that would actually be caught by those
> checks; I theorize that because of the few places we actually call
> generate_[useful_]gather_paths we are in practice already excluding
> those, but for now I've left these as a conditional rather than an
> assertion because it seems like the kind of guard we'd want to ensure
> those methods are safe.

Maybe we can later (in separate patches) relax the restrictions imposed on
partial path creation a little bit, so that more parameterized partial paths
are created.

One particular case that should be rejected by your checks is a partial index
path, which can be parameterized, but I couldn't construct a query that makes
your checks fire. Maybe the reason is that a parameterized index path is
mostly used on the inner side of a NL join, however no partial path can be
used there. (The problem is that each worker evaluating the NL join would only
see a subset of the inner relation, which whould lead to incorrect results.)

So I'd also choose conditions rather than assert statements.

Following are my (minor) findings:

In generate_gather_paths() you added this test

/*
* Delay gather path creation until the level in the join tree where all
* params used in a worker are generated within that worker.
*/
if (!bms_is_subset(required_outer, rel->relids))
return;

but I'm not sure if required_outer can contain anything of rel->relids. How
about using bms_is_empty(required) outer, or even this?

if (required_outer)
return;

Similarly,

/* We can't pass params to workers. */
if (!bms_is_subset(PATH_REQ_OUTER(cheapest_partial_path), rel->relids))

might look like

if (!bms_is_empty(PATH_REQ_OUTER(cheapest_partial_path)))

or

if (PATH_REQ_OUTER(cheapest_partial_path))

In particular, build_index_paths() does the following when setting
outer_relids (which eventually becomes (path->param_info->ppi_req_outer):

/* Enforce convention that outer_relids is exactly NULL if empty */
if (bms_is_empty(outer_relids))
outer_relids = NULL;

Another question is whether in this call

simple_gather_path = (Path *)
create_gather_path(root, rel, cheapest_partial_path, rel->reltarget,
required_outer, rowsp);

required_outer should be passed to create_gather_path(). Shouldn't it rather
be PATH_REQ_OUTER(cheapest_partial_path) that you test just above? Again,
build_index_paths() initializes outer_relids this way

outer_relids = bms_copy(rel->lateral_relids);

but then it may add some more relations:

/* OK to include this clause */
index_clauses = lappend(index_clauses, iclause);
outer_relids = bms_add_members(outer_relids,
rinfo->clause_relids);

So I think that PATH_REQ_OUTER(cheapest_partial_path) in
generate_gather_paths() can eventually contain more relations than
required_outer, and therefore it's safer to check the first.

Similar comments might apply to generate_useful_gather_paths(). Here I also
suggest to move this test

/* We can't pass params to workers. */
if (!bms_is_subset(PATH_REQ_OUTER(subpath), rel->relids))
continue;

to the top of the loop because it's relatively cheap.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-02-06 16:42:57 Re: undersized unions
Previous Message Tom Lane 2023-02-06 16:23:04 Re: [PATCH] Add pretty-printed XML output option