Re: Consider parallel for lateral subqueries with limit

From: James Coleman <jtc331(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, brian(at)brianlikespostgres(dot)com
Subject: Re: Consider parallel for lateral subqueries with limit
Date: 2022-09-26 15:17:01
Message-ID: CAAaqYe-8nyU0msNUjSyBNZFWph4p6or4HGv=2H2qbdSKDAyHCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 26, 2022 at 10:37 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Sep 22, 2022 at 5:19 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> > > Your sample query gets a plan like this:
> > >
> > > Nested Loop Left Join (cost=0.00..1700245.00 rows=10000 width=8)
> > > -> Seq Scan on foo (cost=0.00..145.00 rows=10000 width=4)
> > > -> Limit (cost=0.00..170.00 rows=1 width=4)
> > > -> Seq Scan on bar (cost=0.00..170.00 rows=1 width=4)
> > > Filter: (foo.a = a)
> > >
> > > If this were to occur inside a larger plan tree someplace, it would be
> > > OK to insert a Gather node above the Nested Loop node without doing
> > > anything further, because then the parameter that stores foo.a would
> > > be both set and used in the worker. If you wanted to insert a Gather
> > > at any other place in this plan, things get more complicated. But just
> > > because you have LATERAL doesn't mean that you have this problem,
> > > because if you delete the "limit 1" then the subqueries get flattened
> > > together and the parameter disappears,
> >
> > For future reference in this email thread when you remove the "limit
> > 1" this is the plan you get:
> >
> > Merge Right Join (cost=372.18..815.71 rows=28815 width=8)
> > Merge Cond: (bar.a = foo.a)
> > -> Sort (cost=158.51..164.16 rows=2260 width=8)
> > Sort Key: bar.a
> > -> Seq Scan on bar (cost=0.00..32.60 rows=2260 width=8)
> > -> Sort (cost=179.78..186.16 rows=2550 width=4)
> > Sort Key: foo.a
> > -> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=4)
> >
> > Just to make sure I'm following: by "doesn't mean that you have this
> > problem" you mean "doesn't mean you have this limitation on parallel
> > query"?
>
> I'm talking specifically about whether there's a parameter. The
> problem here isn't created by LATERAL, but by parameters. In the
> nested loop plan, there's a parameter that's storing foo.a, and the
> storage location that holds that parameter value is in backend-private
> memory, so you can't set the value in the leader and then use it in a
> worker, and that restricts where you can insert a Gather node. But in
> the Merge Join plan (or if you got a Hash Join plan) there is no
> parameter. So the fact that parameter storage isn't shared between
> leaders and workers does not matter.

Ah, yes, right.

> > Yes, that's a good point too. I need to play with these examples and
> > confirm whether lateral_relids gets set in that case. IIRC when that's
> > set isn't exactly the same as whether or not the LATERAL keyword is
> > used, and I should clarify that my claims here are meant to be about
> > when we execute it that way regardless of the keyword usage. The
> > keyword usage I'd assumed just made it easier to talk about, but maybe
> > you're implying that it's actually generating confusion.
>
> Yes, I think so.
>
> Stepping back a bit, commit 75f9c4ca5a8047d7a9cfbc7d51a610933d04dc7f
> introduced the code that is at issue here, and it took the position
> that limit/offset should be marked parallel-restricted because they're
> nondeterministic. I'm not sure that's really quite the right thing.
> The original idea behind having things be parallel-restricted was that
> there might be stuff which depends on state in the leader that is not
> replicated to or shared with the workers, and thus it must be executed
> in the leader to get the right answer. Here, however, there is no such
> problem. Something like LIMIT/OFFSET that is nondeterministic is
> perfectly safe to execute in a worker, or in the leader. It's just not
> safe to execute the same computation more than once and assume that we
> will get the same answer each time. But that's really a different
> problem.
>
> And the problem that you've run into here, I think, is that if a Limit
> node is getting fed a parameter from higher up in the query plan, then
> it's not really the same computation being performed every time, and
> thus the non-determinism doesn't matter, and thus the
> parallel-restriction goes away, but the code doesn't know that.

Correct. I think the simplest description of the proper limitation is
that we can't parallelize when either:
1. The param comes from outside the worker, or
2. The "same" param value from the same tuple is computed in multiple
workers (as distinct from the same value from different outer tuples).
Or, to put it positively, when can parallelize when:
1. There are no params, or
2. The param is uniquely generated and used inside a single worker.

I believe the followup email I sent (with an updated patch) details
the correctness of that approach; I'd be interested in your feedback
on that (I recognize it's quite a long email, though...) if you get a
chance.

Thanks for your review on this so far,
James Coleman

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2022-09-26 15:23:02 Re: Add support for DEFAULT specification in COPY FROM
Previous Message Israel Barth Rubio 2022-09-26 15:12:15 Re: Add support for DEFAULT specification in COPY FROM