Re: Parallelize correlated subqueries that execute within each worker

From: James Coleman <jtc331(at)gmail(dot)com>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Parallelize correlated subqueries that execute within each worker
Date: 2023-02-08 13:38:27
Message-ID: CAAaqYe8fF=qc2Z+ttmrBkCMhV976Sy83uS1w9CyXMV7=X7NSMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 6, 2023 at 11:39 AM Antonin Houska <ah(at)cybertec(dot)at> wrote:
>
> James Coleman <jtc331(at)gmail(dot)com> wrote:
> > Which this patch we do in fact now see (as expected) rels with
> > non-empty lateral_relids showing up in generate_[useful_]gather_paths.
> > And the partial paths can now have non-empty required outer rels. I'm
> > not able to come up with a plan that would actually be caught by those
> > checks; I theorize that because of the few places we actually call
> > generate_[useful_]gather_paths we are in practice already excluding
> > those, but for now I've left these as a conditional rather than an
> > assertion because it seems like the kind of guard we'd want to ensure
> > those methods are safe.
>
> Maybe we can later (in separate patches) relax the restrictions imposed on
> partial path creation a little bit, so that more parameterized partial paths
> are created.
>
> One particular case that should be rejected by your checks is a partial index
> path, which can be parameterized, but I couldn't construct a query that makes
> your checks fire. Maybe the reason is that a parameterized index path is
> mostly used on the inner side of a NL join, however no partial path can be
> used there. (The problem is that each worker evaluating the NL join would only
> see a subset of the inner relation, which whould lead to incorrect results.)
>
> So I'd also choose conditions rather than assert statements.

Thanks for confirming.

>
> Following are my (minor) findings:
>
> In generate_gather_paths() you added this test
>
> /*
> * Delay gather path creation until the level in the join tree where all
> * params used in a worker are generated within that worker.
> */
> if (!bms_is_subset(required_outer, rel->relids))
> return;
>
> but I'm not sure if required_outer can contain anything of rel->relids. How
> about using bms_is_empty(required) outer, or even this?
>
> if (required_outer)
> return;
>
> Similarly,
>
> /* We can't pass params to workers. */
> if (!bms_is_subset(PATH_REQ_OUTER(cheapest_partial_path), rel->relids))
>
> might look like
>
> if (!bms_is_empty(PATH_REQ_OUTER(cheapest_partial_path)))
>
> or
>
> if (PATH_REQ_OUTER(cheapest_partial_path))

I'm not sure about this change. Deciding is difficult given the fact
that we don't seem to currently generate these paths, but I don't see
a reason why lateral_relids can't be present on the rel, and if so,
then we need to check to see if they're a subset of relids we can
satisfy rather than checking that they don't exist.

> In particular, build_index_paths() does the following when setting
> outer_relids (which eventually becomes (path->param_info->ppi_req_outer):
>
> /* Enforce convention that outer_relids is exactly NULL if empty */
> if (bms_is_empty(outer_relids))
> outer_relids = NULL;
>
>
> Another question is whether in this call
>
> simple_gather_path = (Path *)
> create_gather_path(root, rel, cheapest_partial_path, rel->reltarget,
> required_outer, rowsp);
>
> required_outer should be passed to create_gather_path(). Shouldn't it rather
> be PATH_REQ_OUTER(cheapest_partial_path) that you test just above? Again,
> build_index_paths() initializes outer_relids this way
>
> outer_relids = bms_copy(rel->lateral_relids);
>
> but then it may add some more relations:
>
> /* OK to include this clause */
> index_clauses = lappend(index_clauses, iclause);
> outer_relids = bms_add_members(outer_relids,
> rinfo->clause_relids);
>
> So I think that PATH_REQ_OUTER(cheapest_partial_path) in
> generate_gather_paths() can eventually contain more relations than
> required_outer, and therefore it's safer to check the first.

Yes, this is a good catch. Originally I didn't know about
PATH_REQ_OUTER, and I'd missed using it in these places.

>
> Similar comments might apply to generate_useful_gather_paths(). Here I also
> suggest to move this test
>
> /* We can't pass params to workers. */
> if (!bms_is_subset(PATH_REQ_OUTER(subpath), rel->relids))
> continue;
>
> to the top of the loop because it's relatively cheap.

Moved.

Attached is v9.

James Coleman

Attachment Content-Type Size
v9-0001-Add-tests-before-change.patch application/octet-stream 7.2 KB
v9-0002-Parallelize-correlated-subqueries.patch application/octet-stream 30.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-02-08 13:48:51 Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Previous Message Andrew Dunstan 2023-02-08 13:27:07 Re: run pgindent on a regular basis / scripted manner