Re: Parallelize correlated subqueries that execute within each worker

From: James Coleman <jtc331(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Parallelize correlated subqueries that execute within each worker
Date: 2022-01-14 19:24:50
Message-ID: CAAaqYe8HLwTGJC2jPZ8noiv2SRSSvS3zSjQQMmeUqNiY4Uw_dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 15, 2021 at 10:01 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Nov 3, 2021 at 1:34 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> > As I understand the current code, parallel plans are largely chosen
> > based not on where it's safe to insert a Gather node but rather by
> > determining if a given path is parallel safe. Through that lens params
> > are a bit of an odd man out -- they aren't inherently unsafe in the
> > way a parallel-unsafe function is, but they can only be used in
> > parallel plans under certain conditions (whether because of project
> > policy, performance, or missing infrastructure).
>
> Right.
>
> > Introducing consider_parallel_rechecking_params and
> > parallel_safe_ignoring_params allows us to keep more context on params
> > and make a more nuanced decision at the proper level of the plan. This
> > is what I mean by "rechecked in the using context", though I realize
> > now that both "recheck" and "context" are overloaded terms in the
> > project, so don't describe the concept particularly clearly. When a
> > path relies on params we can only make a final determination about its
> > parallel safety if we know whether or not the current parallel node
> > can provide the param's value. We don't necessarily know that
> > information until we attempt to generate a full parallel node in the
> > plan (I think what you're describing as "inserting a Gather node")
> > since the param may come from another node in the plan. These new
> > values allow us to do that by tracking tentatively parallel-safe
> > subplans (given proper Gather node placement) and delaying the
> > parallel-safety determination until the point at which a param is
> > available (or not).
>
> So I think I agree with you here. But I don't like all of this
> "ignoring_params" stuff and I don't see why it's necessary. Say we
> don't have both parallel_safe and parallel_safe_ignoring_params. Say
> we just have parallel_safe. If the plan will be parallel safe if the
> params are available, we label it parallel safe. If the plan will not
> be parallel safe even if the params are available, we say it's not
> parallel safe. Then, when we get to generate_gather_paths(), we don't
> generate any paths if there are required parameters that are not
> available. What's wrong with that approach?
>
> Maybe it's clearer to say this: I feel like one extra Boolean is
> either too much or too little. I think maybe it's not even needed. But
> if it is needed, then why just a bool instead of, say, a Bitmapset of
> params that are needed, or something?
>
> I'm sort of speaking from intuition here rather than sure knowledge. I
> might be totally wrong.

Apologies for quite the delay responding to this.

I've been chewing on this a bit, and I was about to go re-read the
code and see how easy it'd be to do exactly what you're suggesting in
generate_gather_paths() (and verifying it doesn't need to happen in
other places). However there's one (I think large) gotcha with that
approach (assuming it otherwise makes sense): it means we do
unnecessary work. In the current patch series we only need to recheck
parallel safety if we're in a situation where we might actually
benefit from doing that work (namely when we have a correlated
subquery we might otherwise be able to execute in a parallel plan). If
we don't track that status we'd have to recheck the full parallel
safety of the path for all paths -- even without correlated
subqueries.

Alternatively we could merge these fields into a single enum field
that tracked these states. Even better, we could use a bitmap to
signify what items are/aren't parallel safe. I'm not sure if that'd
create even larger churn in the patch, but maybe it's worth it either
way. In theory it'd open up further expansions to this concept later
(though I'm not aware of any such ideas).

If you think such an approach would be an improvement I'd be happy to
take a pass at a revised patch.

Thoughts?

Thanks,
James Coleman

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Borisov 2022-01-14 19:38:46 Re: Add 64-bit XIDs into PostgreSQL 15
Previous Message Bossart, Nathan 2022-01-14 19:16:01 Re: O(n) tasks cause lengthy startups and checkpoints