Re: Consider parallel for lateral subqueries with limit

From: James Coleman <jtc331(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, brian(at)brianlikespostgres(dot)com
Subject: Re: Consider parallel for lateral subqueries with limit
Date: 2022-09-22 21:19:42
Message-ID: CAAaqYe8yfUkgYsG418a=_PjVdBnX4dxv+qq5bWcYy+KiampuPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 19, 2022 at 4:29 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Sep 19, 2022 at 3:58 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> > But in the case where there's correlation via LATERAL we already don't
> > guarantee unique executions for a given set of params into the lateral
> > subquery execution, right? For example, suppose we have:
> >
> > select *
> > from foo
> > left join lateral (
> > select n
> > from bar
> > where bar.a = foo.a
> > limit 1
> > ) on true
> >
> > and suppose that foo.a (in the table scan) returns these values in
> > order: 1, 2, 1. In that case we'll execute the lateral subquery 3
> > separate times rather than attempting to order the results of foo.a
> > such that we can re-execute the subquery only when the param changes
> > to a new unique value (and we definitely don't cache the results to
> > guarantee unique subquery executions).
>
> I think this is true, but I don't really understand why we should
> focus on LATERAL here. What we really need, and I feel like we've
> talked about this before, is a way to reason about where parameters
> are set and used.

Yes, over in the thread "Parallelize correlated subqueries that
execute within each worker" [1] which was meant to build on top of
this work (though is technically separate). Your bringing it up here
too is making me wonder if we can combine the two and instead of
always allowing subqueries with LIMIT instead (like the other patch
does) delay final determination of parallel safety of rels and paths
(perhaps, as that thread is discussing, until gather/gather merge path
creation).

Side note: I'd kinda like a way (maybe a development GUC or even a
compile time option) to have EXPLAIN show where params are set. If
something like that exists, egg on my face I guess, but I'd love to
know about it.

> Your sample query gets a plan like this:
>
> Nested Loop Left Join (cost=0.00..1700245.00 rows=10000 width=8)
> -> Seq Scan on foo (cost=0.00..145.00 rows=10000 width=4)
> -> Limit (cost=0.00..170.00 rows=1 width=4)
> -> Seq Scan on bar (cost=0.00..170.00 rows=1 width=4)
> Filter: (foo.a = a)
>
> If this were to occur inside a larger plan tree someplace, it would be
> OK to insert a Gather node above the Nested Loop node without doing
> anything further, because then the parameter that stores foo.a would
> be both set and used in the worker. If you wanted to insert a Gather
> at any other place in this plan, things get more complicated. But just
> because you have LATERAL doesn't mean that you have this problem,
> because if you delete the "limit 1" then the subqueries get flattened
> together and the parameter disappears,

For future reference in this email thread when you remove the "limit
1" this is the plan you get:

Merge Right Join (cost=372.18..815.71 rows=28815 width=8)
Merge Cond: (bar.a = foo.a)
-> Sort (cost=158.51..164.16 rows=2260 width=8)
Sort Key: bar.a
-> Seq Scan on bar (cost=0.00..32.60 rows=2260 width=8)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: foo.a
-> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=4)

Just to make sure I'm following: by "doesn't mean that you have this
problem" you mean "doesn't mean you have this limitation on parallel
query"?

The "flattening out" (conversion to join between two table scans
executed a single time each) is an interesting case because I consider
that to be "just" a performance optimization, and therefore I don't
think anything about the guarantees a user expects should change. But
interestingly: it does end up providing stronger guarantees about the
query results than it would if the conversion didn't happen (the
subquery results in only a single table scan whereas without the
change a scan per outer row means it's *possible* to get different
results across different outer rows even with the same join key
value).

> and if you delete the lateral
> reference (i.e. WHERE foo.a = bar.a) then there's still a subquery but
> it no longer refers to an outer parameter. And on the flip side just
> because you don't have LATERAL doesn't mean that you don't have this
> problem. e.g. the query could instead be:
>
> select *, (select n from bar where bar.a = foo.a limit 1) from foo;
>
> ...which I think is pretty much equivalent to your formulation and has
> the same problem as far as parallel query as your formulation but does
> not involve the LATERAL keyword.

Yes, that's a good point too. I need to play with these examples and
confirm whether lateral_relids gets set in that case. IIRC when that's
set isn't exactly the same as whether or not the LATERAL keyword is
used, and I should clarify that my claims here are meant to be about
when we execute it that way regardless of the keyword usage. The
keyword usage I'd assumed just made it easier to talk about, but maybe
you're implying that it's actually generating confusion.

James Coleman

1: https://www.postgresql.org/message-id/CA%2BTgmoYXm2NCLt1nikWfYj1_r3%3DfsoNCHCtDVdN7X1uX_xuXgw%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-09-22 21:21:27 Re: Extending outfuncs support to utility statements
Previous Message Matthias van de Meent 2022-09-22 21:12:32 Re: Adding CommandID to heap xlog records