Re: Consider parallel for lateral subqueries with limit

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Greg Nancarrow <gregn4422(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, brian(at)brianlikespostgres(dot)com
Subject: Re: Consider parallel for lateral subqueries with limit
Date: 2022-03-01 22:35:22
Message-ID: 2888169.1646174122@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

James Coleman <jtc331(at)gmail(dot)com> writes:
> On Tue, Jan 4, 2022 at 9:59 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
>> On Tue, Jan 4, 2022 at 5:31 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I don't really see why this patch is even a little bit safe.

> Suppose lateral is not in play. Then if we have a plan like:

> Gather
> NestLoop
> Scan X
> Limit
> Scan Y

> Because we have the result "X join Limit(Y)" we need "Limit(Y)" to be
> consistent across all of the possible executions of "Limit(Y)" (i.e.,
> in each worker it executes in). That means (absent infrastructure for
> guaranteeing a unique ordering) we obviously can't parallelize the
> inner side of the join as the limit may be applied in different ways
> in each worker's execution.

> Now suppose lateral is in play. Then (given the same plan) instead of
> our result being "X join Limit(Y)" the result is "X join Limit(Y sub
> X)", that is, each row in X is joined to a unique invocation of
> "Limit(Y)".

This argument seems to be assuming that Y is laterally dependent on X,
but the patch as written will take *any* lateral dependency as a
get-out-of-jail-free card. If what we have is "Limit(Y sub Z)" where
Z is somewhere else in the query tree, it's not apparent to me that
your argument holds.

But more generally, I don't think you've addressed the fundamental
concern, which is that a query involving Limit is potentially
nondeterministic (if it lacks a fully-deterministic ORDER BY),
so that different workers could get different answers from it if
they're using a plan type that permits that to happen. (See the
original discussion that led to 75f9c4ca5, at [1].) I do not see
how a lateral dependency removes that hazard. The bug report that
started the original discussion hit the problem because it
generated a plan like

Gather
-> Hash Semi Join
-> Parallel Seq Scan
-> Hash
-> Limit
-> Seq Scan

We didn't make the submitter drill down far enough to verify
exactly why he got nondeterministic results from the Limit, but
I suppose the reason was that the table was big enough to trigger
"synchronize_seqscans" behavior, allowing different workers to
read different parts of that table. Now, that particular case
didn't have any lateral dependency, but if there was one it'd
just have resulted in changing the hash join to a nestloop join,
and the nondeterminism hazard would be exactly the same AFAICS.

> In this case we are already conceivably getting different
> results for each execution of the subquery "Limit(Y)" even if we're
> not running those executions across multiple workers.

That seems to be about the same argument Andres made initially
in the old thread, but we soon shot that down as not being the
level of guarantee we want to provide. There's nothing in the
SQL standard that says that
select * from events where account in
(select account from events
where data->>'page' = 'success.html' limit 3);
(the original problem query) shall execute the sub-query
only once, but people expect it to act that way.

If you want to improve this area, my feeling is that it'd be
better to look into what was speculated about in the old
thread: LIMIT doesn't create nondeterminism if the query has
an ORDER BY that imposes a unique row ordering, ie
order-by-primary-key. We didn't have planner infrastructure
that would allow checking that cheaply in 2018, but maybe
there is some now?

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/153417684333.10284.11356259990921828616%40wrigleys.postgresql.org

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-03-01 22:53:34 Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Previous Message Jacob Champion 2022-03-01 22:05:37 Re: [PATCH] Expose port->authn_id to extensions and triggers