RE: Parallel INSERT (INTO ... SELECT ...)

From: "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "Tang, Haiying" <tanghy(dot)fnst(at)cn(dot)fujitsu(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Langote <amitlangote09(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>
Subject: RE: Parallel INSERT (INTO ... SELECT ...)
Date: 2021-02-10 09:59:13
Message-ID: c9d78965120340d3a45df299d5b46915@G08CNEXMBPEKD05.g08.fujitsu.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > >
> > > else if (IsA(node, Query))
> > > {
> > > Query *query = (Query *) node;
> > >
> > > /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
> > > if (query->rowMarks != NULL)
> > > {
> > > context->max_hazard = PROPARALLEL_UNSAFE;
> > > return true;
> > > }
> >
> > In my understanding, the max_parallel_hazard() query tree walk is to
> > find constructs that are parallel unsafe in that they call code that
> > can't run in parallel mode. For example, FOR UPDATE/SHARE on
> > traditional heap AM tuples calls AssignTransactionId() which doesn't
> > support being called in parallel mode. Likewise ON CONFLICT ... DO
> > UPDATE calls heap_update() which doesn't support parallelism. I'm not
> > aware of any such hazards downstream of ExecValuesScan().
> >
> > > >You're trying to
> > > > add something that bails based on second-guessing that a parallel
> > > >path would not be chosen, which I find somewhat objectionable.
> > > >
> > > > If the main goal of bailing out is to avoid doing the potentially
> > > > expensive modification safety check on the target relation, maybe
> > > > we should try to somehow make the check less expensive. I
> > > > remember reading somewhere in the thread about caching the result
> > > > of this check in relcache, but haven't closely studied the feasibility
> of doing so.
> > > >
> > >
> > > There's no "second-guessing" involved here.
> > > There is no underlying way of dividing up the VALUES data of
> > > "INSERT...VALUES" amongst the parallel workers, even if the planner
> > > was updated to produce a parallel-plan for the "INSERT...VALUES"
> > > case (apart from the fact that spawning off parallel workers to
> > > insert that data would almost always result in worse performance
> > > than a non-parallel plan...) The division of work for parallel
> > > workers is part of the table AM
> > > (scan) implementation, which is not invoked for "INSERT...VALUES".
> >
> > I don't disagree that the planner would not normally assign a parallel
> > path simply to pull values out of a VALUES list mentioned in the
> > INSERT command, but deciding something based on the certainty of it in
> > an earlier planning phase seems odd to me. Maybe that's just me
> > though.
> >
>
> I think it is more of a case where neither is a need for parallelism nor
> we want to support parallelism of it. The other possibility for such a check
> could be at some earlier phase say in standard_planner [1] where we are
> doing checks for other constructs where we don't want parallelism (I think
> the check for 'parse->hasModifyingCTE' is quite similar). If you see in
> that check as well we just assume other operations to be in the category
> of parallel-unsafe. I think we should rule out such cases earlier than later.
> Do you have better ideas than what Greg has done to avoid parallelism for
> such cases?
>
> [1] -
> standard_planner()
> {
> ..
> if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 && IsUnderPostmaster &&
> parse->commandType == CMD_SELECT &&
> !parse->hasModifyingCTE &&
> max_parallel_workers_per_gather > 0 &&
> !IsParallelWorker())
> {
> /* all the cheap tests pass, so scan the query tree */
> glob->maxParallelHazard = max_parallel_hazard(parse); parallelModeOK =
> glob->(glob->maxParallelHazard != PROPARALLEL_UNSAFE);
> }
> else
> {
> /* skip the query tree scan, just assume it's unsafe */
> glob->maxParallelHazard = PROPARALLEL_UNSAFE; parallelModeOK = false;
> }

+1.

In the current parallel_dml option patch. I put this check and some high-level check in a separate function called is_parallel_possible_for_modify.

- * PROPARALLEL_UNSAFE, PROPARALLEL_RESTRICTED, PROPARALLEL_SAFE
+ * Check at a high-level if parallel mode is able to be used for the specified
+ * table-modification statement.
+ * It's not possible in the following cases:
+ *
+ * 1) enable_parallel_dml is off
+ * 2) UPDATE or DELETE command
+ * 3) INSERT...ON CONFLICT...DO UPDATE
+ * 4) INSERT without SELECT on a relation
+ * 5) the reloption parallel_dml_enabled is not set for the target table
+ *
+ * (Note: we don't do in-depth parallel-safety checks here, we do only the
+ * cheaper tests that can quickly exclude obvious cases for which
+ * parallelism isn't supported, to avoid having to do further parallel-safety
+ * checks for these)
*/
+bool
+is_parallel_possible_for_modify(Query *parse)

And I put the function at earlier place like the following:

if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
IsUnderPostmaster &&
(parse->commandType == CMD_SELECT ||
- (enable_parallel_dml &&
- IsModifySupportedInParallelMode(parse->commandType))) &&
+ is_parallel_possible_for_modify(parse)) &&
!parse->hasModifyingCTE &&
max_parallel_workers_per_gather > 0 &&
!IsParallelWorker())

If this looks good, maybe we can merge this change.

Best regards,
houzj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2021-02-10 10:03:42 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message Massimo Fidanza 2021-02-10 08:57:24 Re: [WIP] UNNEST(REFCURSOR): allowing SELECT to consume data from a REFCURSOR