Re: Parallelize correlated subqueries that execute within each worker

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: Parallelize correlated subqueries that execute within each worker
Date: 2021-11-03 14:48:48
Message-ID: CA+TgmobtK2OnhNaKbq8Q+k217mCUdpLFeMXTjxr6QZMG3KE5Gw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

As a preliminary comment, it would be quite useful to get Tom Lane's
opinion on this, since it's not an area I understand especially well,
and I think he understands it better than anyone.

On Fri, May 7, 2021 at 12:30 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> The basic idea is that we need to track (both on nodes and relations)
> not only whether that node or rel is parallel safe but also whether
> it's parallel safe assuming params are rechecked in the using context.
> That allows us to delay making a final decision until we have
> sufficient context to conclude that a given usage of a Param is
> actually parallel safe or unsafe.

I don't really understand what you mean by "assuming params are
rechecked in the using context." However, I think that a possibly
better approach to this whole area would be to try to solve the
problem by putting limits on where you can insert a Gather node.
Consider:

Nested Loop
-> Seq Scan on x
-> Index Scan on y
Index Cond: y.q = x.q

If you insert a Gather node atop the Index Scan, necessarily changing
it to a Parallel Index Scan, then you need to pass values around. For
every value we get for x.q, we would need to start workers, sending
them the value of x.q, and they do a parallel index scan working
together to find all rows where y.q = x.q, and then exit. We repeat
this for every tuple from x.q. In the absence of infrastructure to
pass those parameters, we can't put the Gather there. We also don't
want to, because it would be really slow.

If you insert the Gather node atop the Seq Scan or the Nested Loop, in
either case necessarily changing the Seq Scan to a Parallel Seq Scan,
you have no problem. If you put it on top of the Nested Loop, the
parameter will be set in the workers and used in the workers and
everything is fine. If you put it on top of the Seq Scan, the
parameter will be set in the leader -- by the Nested Loop -- and used
in the leader, and again you have no problem.

So in my view of the world, the parameter just acts as an additional
constraint on where Gather nodes can be placed. I don't see that there
are any parameters that are unsafe categorically -- they're just
unsafe if the place where they are set is on a different side of the
Gather from the place where they are used. So I don't understand --
possibly just because I'm dumb -- the idea behind
consider_parallel_rechecking_params, because that seems to be making a
sort of overall judgement about the safety or unsafety of the
parameter on its own merits, rather than thinking about the Gather
placement.

When I last worked on this, I had hoped that extParam or allParam
would be the thing that would answer the question: are there any
parameters used under this node that are not also set under this node?
But I seem to recall that neither seemed to be answering precisely
that question, and the lousy naming of those fields and limited
documentation of their intended purpose did not help.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Pyhalov 2021-11-03 14:50:19 Re: Partial aggregates pushdown
Previous Message David Fetter 2021-11-03 14:39:20 Re: Should we support new definition for Identity column : GENERATED BY DEFAULT ON NULL?