Re: Parallel INSERT (INTO ... SELECT ...)

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Greg Nancarrow <gregn4422(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "Tang, Haiying" <tanghy(dot)fnst(at)cn(dot)fujitsu(dot)com>
Subject: Re: Parallel INSERT (INTO ... SELECT ...)
Date: 2021-02-10 08:50:59
Message-ID: CA+HiwqH3eCaCgTgvhtBH03f8jAjnvKpmHp25Y4oQejGNYHuhLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 10, 2021 at 5:24 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Wed, Feb 10, 2021 at 1:00 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> > On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
> > > It's parallel UNSAFE because it's not safe or even possible to have a
> > > parallel plan for this.
> > > (as UNSAFE is the max hazard level, no point in referencing
> > > context->max_interesting).
> > > And there are existing cases of bailing out and not doing further
> > > safety checks (even your v15_delta.diff patch placed code - for
> > > bailing out on "ON CONFLICT ... DO UPDATE" - underneath one such
> > > existing case in max_parallel_hazard_walker()):
> > >
> > > else if (IsA(node, Query))
> > > {
> > > Query *query = (Query *) node;
> > >
> > > /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
> > > if (query->rowMarks != NULL)
> > > {
> > > context->max_hazard = PROPARALLEL_UNSAFE;
> > > return true;
> > > }
> >
> > In my understanding, the max_parallel_hazard() query tree walk is to
> > find constructs that are parallel unsafe in that they call code that
> > can't run in parallel mode. For example, FOR UPDATE/SHARE on
> > traditional heap AM tuples calls AssignTransactionId() which doesn't
> > support being called in parallel mode. Likewise ON CONFLICT ... DO
> > UPDATE calls heap_update() which doesn't support parallelism. I'm not
> > aware of any such hazards downstream of ExecValuesScan().
> >
> > > >You're trying to
> > > > add something that bails based on second-guessing that a parallel path
> > > > would not be chosen, which I find somewhat objectionable.
> > > >
> > > > If the main goal of bailing out is to avoid doing the potentially
> > > > expensive modification safety check on the target relation, maybe we
> > > > should try to somehow make the check less expensive. I remember
> > > > reading somewhere in the thread about caching the result of this check
> > > > in relcache, but haven't closely studied the feasibility of doing so.
> > > >
> > >
> > > There's no "second-guessing" involved here.
> > > There is no underlying way of dividing up the VALUES data of
> > > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > > was updated to produce a parallel-plan for the "INSERT...VALUES" case
> > > (apart from the fact that spawning off parallel workers to insert that
> > > data would almost always result in worse performance than a
> > > non-parallel plan...)
> > > The division of work for parallel workers is part of the table AM
> > > (scan) implementation, which is not invoked for "INSERT...VALUES".
> >
> > I don't disagree that the planner would not normally assign a parallel
> > path simply to pull values out of a VALUES list mentioned in the
> > INSERT command, but deciding something based on the certainty of it in
> > an earlier planning phase seems odd to me. Maybe that's just me
> > though.
> >
>
> I think it is more of a case where neither is a need for parallelism
> nor we want to support parallelism of it. The other possibility for
> such a check could be at some earlier phase say in standard_planner
> [1] where we are doing checks for other constructs where we don't want
> parallelism (I think the check for 'parse->hasModifyingCTE' is quite
> similar). If you see in that check as well we just assume other
> operations to be in the category of parallel-unsafe. I think we should
> rule out such cases earlier than later. Do you have better ideas than
> what Greg has done to avoid parallelism for such cases?
>
> [1] -
> standard_planner()
> {
> ..
> if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
> IsUnderPostmaster &&
> parse->commandType == CMD_SELECT &&
> !parse->hasModifyingCTE &&
> max_parallel_workers_per_gather > 0 &&
> !IsParallelWorker())
> {
> /* all the cheap tests pass, so scan the query tree */
> glob->maxParallelHazard = max_parallel_hazard(parse);
> glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE);
> }
> else
> {
> /* skip the query tree scan, just assume it's unsafe */
> glob->maxParallelHazard = PROPARALLEL_UNSAFE;
> glob->parallelModeOK = false;
> }

Yeah, maybe having the block I was commenting on, viz.:

+ /*
+ * If there is no underlying SELECT, a parallel table-modification
+ * operation is not possible (nor desirable).
+ */
+ hasSubQuery = false;
+ foreach(lc, parse->rtable)
+ {
+ rte = lfirst_node(RangeTblEntry, lc);
+ if (rte->rtekind == RTE_SUBQUERY)
+ {
+ hasSubQuery = true;
+ break;
+ }
+ }
+ if (!hasSubQuery)
+ return PROPARALLEL_UNSAFE;

before the standard_planner() block you quoted might be a good idea.
So something like this:

+ /*
+ * If there is no underlying SELECT, a parallel table-modification
+ * operation is not possible (nor desirable).
+ */
+ rangeTablehasSubQuery = false;
+ foreach(lc, parse->rtable)
+ {
+ rte = lfirst_node(RangeTblEntry, lc);
+ if (rte->rtekind == RTE_SUBQUERY)
+ {
+ rangeTableHasSubQuery = true;
+ break;
+ }
+ }

if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
IsUnderPostmaster &&
(parse->commandType == CMD_SELECT ||
(IsModifySupportedInParallelMode(parse->commandType) &&
rangeTableHasSubQuery)) &&
!parse->hasModifyingCTE &&
max_parallel_workers_per_gather > 0 &&
!IsParallelWorker())
{
/* all the cheap tests pass, so scan the query tree */
glob->maxParallelHazard = max_parallel_hazard(parse);
glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE);
}
else
{
/* skip the query tree scan, just assume it's unsafe */
glob->maxParallelHazard = PROPARALLEL_UNSAFE;
glob->parallelModeOK = false;
}

--
Amit Langote
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2021-02-10 08:52:52 RE: Parallel INSERT (INTO ... SELECT ...)
Previous Message Amit Kapila 2021-02-10 08:50:19 Re: Parallel INSERT (INTO ... SELECT ...)