Re: Parallelize correlated subqueries that execute within each worker

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: James Coleman <jtc331(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: Parallelize correlated subqueries that execute within each worker
Date: 2021-11-03 16:27:42
Message-ID: CA+TgmobpQPdJC+LPgdxNWUV-0hOR=B-rOJQC_8ihB_JG3fU7mA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 3, 2021 at 11:14 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> FWIW, I've never been very happy with those fields either. IIRC the
> design in that area was all Vadim's, but to the extent that there's
> any usable documentation of extParam/allParam, it was filled in by me
> while trying to understand what Vadim did. If somebody wants to step
> up and do a rewrite to make the planner's Param management more useful
> or at least easier to understand, I think that'd be great.

Good to know, thanks.

> But anyway: yeah, those fields as currently constituted don't help
> much. They tell you which Params are consumed by this node or its
> subnodes, but not where those Params came from. The planner's
> plan_params and outer_params fields might be more nearly the right
> thing, but I'm not sure they're spot-on either, nor that they're
> up-to-date at the point where you'd want to make decisions about
> Gather safety.

One thing I discovered when I was looking at this a few years ago is
that there was only one query in the regression tests where extParam
and allParam were not the same. The offending query was select 1 =
all(select (select 1)), and the resulting plan has a Materialize node
with an attached InitPlan. For that Materialize node, extParam = {}
and allParam = {$0}, with $0 also being the output parameter of the
InitPlan attached that that Materialize node. In every other node in
that plan and in every node of every other plan generated by the
regression tests, the values were identical. So it's extremely niche
that these fields are even different from each other, and it's unclear
to me that we really need both of them.

What's also interesting is that extParam is computed (by
finalize_plan) as plan->extParam = bms_del_members(plan->extParam,
initSetParam). So I think it mostly ends up that extParam for a node
is not exactly all the parameters that anything under that node cares
about, but rather - approximately - all the things that anything under
that node cares about that aren't also set someplace under that node.
If it were exactly that, I think it would be perfect for our needs
here: if the set of things used but not set below the current level is
empty, it's OK to insert a Gather node; otherwise, it's not, at least,
not unless we find a way to pipe parameters from the leader into the
workers. But I think there's some reason that I no longer remember why
it's not exactly that, and therefore the idea doesn't work.

One problem I do remember is that attaching initplans at the top of
each subquery level as we presently do is really not good for this
kind of thing. Suppose you have several levels of Nested Loop and
someplace down in the plan you reference an InitPlan. The planner sees
no harm in attaching the InitPlan at the top level, which makes it
unsafe to put the Gather any place but at the top level. If you
attached the InitPlan to the lowest node in the plan tree that is high
enough to be above all the places that use the value from that
parameter, you could potentially shift the Gather down the plan tree,
which would be great if, for example, there's exactly one
parallel-restricted join and the rest are parallel-safe. The best plan
might be to do all the other joins under a Gather and then perform the
parallel-restricted join above it.

But I found it very hard to figure out how to rejigger the logic that
places InitPlans to be more intelligent, and eventually gave up.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-11-03 16:52:16 Re: Parallelize correlated subqueries that execute within each worker
Previous Message Ivan Panchenko 2021-11-03 16:15:18 Re[2]: On login trigger: take three