Re: initial pruning in parallel append

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: initial pruning in parallel append
Date: 2023-08-08 06:58:02
Message-ID: CA+HiwqGoTF_Vd37JJgkJCqufv7SZ+b=0kbw3Ee=Dj7BH_e2mPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 8, 2023 at 12:53 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Aug 7, 2023 at 10:25 AM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > Note we’re talking here about “initial” pruning that occurs during ExecInitNode(). Workers are only launched during ExecGather[Merge]() which thereafter do ExecInitNode() on their copy of the the plan tree. So if we are to pass the pruning results for cross-checking, it will have to be from the leader to workers.
>
> That doesn't seem like a big problem because there aren't many node
> types that do pruning, right? I think we're just talking about Append
> and MergeAppend, or something like that, right?

MergeAppend can't be parallel-aware atm, so only Append.

> You just need the
> ExecWhateverEstimate function to budget some DSM space to store the
> information, which can basically just be a bitmap over the set of
> child plans, and the ExecWhateverInitializeDSM copies the information
> into that DSM space, and ExecWhateverInitializeWorker() copies the
> information from the shared space back into the local node (or maybe
> just points to it, if the representation is sufficiently compatible).
> I feel like this is an hour or two's worth of coding, unless I'm
> missing something, and WAY easier than trying to reason about what
> happens if expression evaluation isn't as stable as we'd like it to
> be.

OK, I agree that we'd better share the pruning result between the
leader and workers.

I hadn't thought about putting the pruning result into Append's DSM
(ParallelAppendState), which is what you're describing IIUC. I looked
into it, though I'm not sure if it can be made to work given the way
things are on the worker side, or at least not without some
reshuffling of code in ParallelQueryMain(). The pruning result will
have to be available in ExecInitAppend, but because the worker reads
the DSM only after finishing the plan tree initialization, it won't.
Perhaps, we can integrate ExecParallelInitializeWorker()'s
responsibilities into ExecutorStart() / ExecInitNode() somehow?

So change the ordering of the following code in ParallelQueryMain():

/* Start up the executor */
queryDesc->plannedstmt->jitFlags = fpes->jit_flags;
ExecutorStart(queryDesc, fpes->eflags);

/* Special executor initialization steps for parallel workers */
queryDesc->planstate->state->es_query_dsa = area;
if (DsaPointerIsValid(fpes->param_exec))
{
char *paramexec_space;

paramexec_space = dsa_get_address(area, fpes->param_exec);
RestoreParamExecParams(paramexec_space, queryDesc->estate);
}
pwcxt.toc = toc;
pwcxt.seg = seg;
ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt);

Looking inside ExecParallelInitializeWorker():

static bool
ExecParallelInitializeWorker(PlanState *planstate, ParallelWorkerContext *pwcxt)
{
if (planstate == NULL)
return false;

switch (nodeTag(planstate))
{
case T_SeqScanState:
if (planstate->plan->parallel_aware)
ExecSeqScanInitializeWorker((SeqScanState *) planstate, pwcxt);

I guess that'd mean putting the if (planstate->plan->parallel_aware)
block seen here at the end of ExecInitSeqScan() and so on.

Or we could consider something like the patch I mentioned in my 1st
email. The idea there was to pass the pruning result via a separate
channel, not the DSM chunk linked into the PlanState tree. To wit, on
the leader side, ExecInitParallelPlan() puts the serialized
List-of-Bitmapset into the shm_toc with a dedicated PARALLEL_KEY,
alongside PlannedStmt, ParamListInfo, etc. The List-of-Bitmpaset is
initialized during the leader's ExecInitNode(). On the worker side,
ExecParallelGetQueryDesc() reads the List-of-Bitmapset string and puts
the resulting node into the QueryDesc, that ParallelQueryMain() then
uses to do ExecutorStart() which copies the pointer to
EState.es_part_prune_results. ExecInitAppend() consults
EState.es_part_prune_results and uses the Bitmapset from there, if
present, instead of performing initial pruning. I'm assuming it's not
too ugly if ExecInitAppend() uses IsParallelWorker() to decide whether
it should be writing to EState.es_part_prune_results or reading from
it -- the former if in the leader and the latter in a worker. If we
are to go with this approach we will need to un-revert ec386948948c,
which moved PartitionPruneInfo nodes out of Append/MergeAppend nodes
to a List in PlannedStmt (copied into EState.es_part_prune_infos),
such that es_part_prune_results mirrors es_part_prune_infos.

Thoughts?

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2023-08-08 07:14:10 Re: Oversight in reparameterize_path_by_child leading to executor crash
Previous Message John Naylor 2023-08-08 06:38:25 Re: [PATCH] Add loongarch native checksum implementation.