Quick Links

Re: Consider parallel for lateral subqueries with limit

From:	James Coleman <jtc331(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	brian(at)brianlikespostgres(dot)com
Subject:	Re: Consider parallel for lateral subqueries with limit
Date:	2020-12-07 23:45:50
Message-ID:	CAAaqYe_zVOj5epc=WhdaSAStNxEuJQHyi2=eF4GXm7rH6ERtFw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Brian's email didn't keep the relevant headers, and so didn't show up
as a reply, so I've pasted it here and am replying in this original
thread. You can find the original at [1].

On Sun, Dec 6, 2020 at 7:34 PM Brian Davis <brian(at)brianlikespostgres(dot)com> wrote:
>
> > Note that near the end of grouping planner we have a similar check:
> >
> > if (final_rel->consider_parallel && root->query_level > 1 &&
> > !limit_needed(parse))
> >
> > guarding copying the partial paths from the current rel to the final
> > rel. I haven't managed to come up with a test case that exposes that
>
> Played around with this a bit, here's a non-correlated subquery that gets us to that if statement
>
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (bar int);
>
> INSERT INTO foo (bar)
> SELECT
> g
> FROM
> generate_series(1, 10000) AS g;
>
>
> SELECT
> (
> SELECT
> bar
> FROM
> foo
> LIMIT 1
> ) AS y
> FROM
> foo;

Thanks. That triggers the parallel if case for limit -- any additional
GUCs need to be modified to get that result? I assume regardless a
parallel plan isn't chosen (even if you remove the limit needed
check)?

> I also was thinking about the LATERAL part.
>
> I couldn't think of any reason why the uncorrelated subquery's results would need to be shared and therefore the same, when we'll be "looping" over each row of the source table, running the subquery anew for each, conceptually.
>
> But then I tried this...
>
> test=# CREATE TABLE foo (bar int);
> CREATE TABLE
> test=#
> test=# INSERT INTO foo (bar)
> test-# SELECT
> test-# g
> test-# FROM
> test-# generate_series(1, 10) AS g;
> INSERT 0 10
> test=#
> test=#
> test=# SELECT
> test-# foo.bar,
> test-# lat.bar
> test-# FROM
> test-# foo JOIN LATERAL (
> test(# SELECT
> test(# bar
> test(# FROM
> test(# foo AS foo2
> test(# ORDER BY
> test(# random()
> test(# LIMIT 1
> test(# ) AS lat ON true;
> bar | bar
> -----+-----
> 1 | 7
> 2 | 7
> 3 | 7
> 4 | 7
> 5 | 7
> 6 | 7
> 7 | 7
> 8 | 7
> 9 | 7
> 10 | 7
> (10 rows)
>
>
> As you can see, random() is only called once. If postgres were supposed to be running the subquery for each source row, conceptually, it would be a mistake to cache the results of a volatile function like random().

This was genuinely surprising to me. I think one could argue that this
is just an optimization (after all -- if there is no correlation, then
running it once is conceptually/safely the same as running it multiple
times), but that doesn't seem to hold water with the volatile function
in play.

Of course, given the volatile function we'd never parallelize this
anyway. But we still have to consider the case where the result is
otherwise ordered differently between workers (just by virtue of disk
order, for example).

I've tried the above query using tenk1 from the regress tests to get a
bit more data, and, along with modifying several GUCs, can force
parallel plans. However in no case can I get it to execute that
uncorrelated lateral in multiple workers. That makes me suspicious
that there's another check in play here ensuring the lateral subquery
is executed for each group, and that in the uncorrelated case really
that rule still holds -- it's just a single group.

> The docs say: "When a FROM item contains LATERAL cross-references, evaluation proceeds as follows: for each row of the FROM item providing the cross-referenced column(s), or set of rows of multiple FROM items providing the columns, the LATERAL item is evaluated using that row or row set's values of the columns. The resulting row(s) are joined as usual with the rows they were computed from. This is repeated for each row or set of rows from the column source table(s)."
>
> They don't say what happens with LATERAL when there aren't cross-references though. As we expect, adding one does show random() being called once for each source row.

If my theory above is correct then it's implicit that the row set is
the whole previous FROM group.

> test=# SELECT
> test-# foo.bar,
> test-# lat.bar
> test-# FROM
> test-# foo JOIN LATERAL (
> test(# SELECT
> test(# bar
> test(# FROM
> test(# foo AS foo2
> test(# WHERE
> test(# foo2.bar < foo.bar + 100000
> test(# ORDER BY
> test(# random()
> test(# LIMIT 1
> test(# ) AS lat ON true;
> bar | bar
> -----+-----
> 1 | 5
> 2 | 8
> 3 | 3
> 4 | 4
> 5 | 5
> 6 | 5
> 7 | 1
> 8 | 3
> 9 | 7
> 10 | 3
> (10 rows)
>
> It seems like to keep the same behavior that exists today, results of LATERAL subqueries would need to be the same if they aren't correlated, and so you couldn't run them in parallel with a limit if the order wasn't guaranteed. But I'll be the first to admit that it's easy enough for me to miss a key piece of logic on something like this, so I could be way off base too.

If it weren't for the volatile function in the example, I think I
could argue we could change before (i.e., my theorizing above
originally about just being an optimization)...but yes, it seems like
this behavior shouldn't change. I can't seem to make it break though
with the patch.

While I haven't actually tracked down to guarantee this is handled
elsewhere, a thought experiment -- I think -- shows it must be so.
Here's why: suppose we don't have a limit here, but the query return
order is different in different backends. Then we would have the same
problem you bring up. In that case this code is already setting
consider_parallel=true on the rel. So I don't think we're changing any
behavior here.

James

1: https://www.postgresql.org/message-id/a50766a4-a927-41c4-984c-76e513b6d1c4%40www.fastmail.com

In response to

Re: Consider parallel for lateral subqueries with limit at 2020-12-01 13:43:38 from James Coleman

Responses

Re: Consider parallel for lateral subqueries with limit at 2021-03-10 17:31:36 from David Steele
Re: Consider parallel for lateral subqueries with limit at 2021-05-28 01:01:31 from Greg Nancarrow

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alexander Korotkov	2020-12-07 23:45:57	Re: range_agg
Previous Message	Rémi Lapeyre	2020-12-07 23:40:24	Re: Add header support to text format and matching feature