Re: Parallel Inserts in CREATE TABLE AS

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Luc Vlaming <luc(at)swarm64(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Zhihong Yu <zyu(at)yugabyte(dot)com>
Subject: Re: Parallel Inserts in CREATE TABLE AS
Date: 2020-12-26 05:40:57
Message-ID: CAFiTN-tgTyaLXiJVvhBMb3EoALZhDxBBdr6XY844bvVTCgaWQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 24, 2020 at 1:07 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Thu, Dec 24, 2020 at 10:25 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > You could change intoclause_len = strlen(intoclausestr) to
> > strlen(intoclausestr) + 1 and use intoclause_len in the remaining
> > places. We can avoid the +1 in the other places.
> > + /* Estimate space for into clause for CTAS. */
> > + if (IS_CTAS(intoclause) && OidIsValid(objectid))
> > + {
> > + intoclausestr = nodeToString(intoclause);
> > + intoclause_len = strlen(intoclausestr);
> > + shm_toc_estimate_chunk(&pcxt->estimator, intoclause_len + 1);
> > + shm_toc_estimate_keys(&pcxt->estimator, 1);
> > + }
>
> Done.
>
> > Can we use node->nworkers_launched == 0 in place of
> > node->need_to_scan_locally, that way the setting and resetting of
> > node->need_to_scan_locally can be removed. Unless need_to_scan_locally
> > is needed in any of the functions that gets called.
> > + /* Enable leader to insert in case no parallel workers were launched. */
> > + if (node->nworkers_launched == 0)
> > + node->need_to_scan_locally = true;
> > +
> > + /*
> > + * By now, for parallel workers (if launched any), would have
> > started their
> > + * work i.e. insertion to target table. In case the leader is chosen to
> > + * participate for parallel inserts in CTAS, then finish its
> > share before
> > + * going to wait for the parallel workers to finish.
> > + */
> > + if (node->need_to_scan_locally)
> > + {
>
> need_to_scan_locally is being set in ExecGather() even if
> nworkers_launched > 0 it can still be true, so I think we can not
> remove need_to_scan_locally in ExecParallelInsertInCTAS.
>
> Attaching v15 patch set for further review. Note that the change is
> only in 0001 patch, other patches remain unchanged from v14.

I have reviewed part of v15-0001 patch, I have a few comments, I will
continue to review this.

1.

@@ -763,18 +763,34 @@ GetCurrentCommandId(bool used)
/* this is global to a transaction, not subtransaction-local */
if (used)
{
- /*
- * Forbid setting currentCommandIdUsed in a parallel worker, because
- * we have no provision for communicating this back to the leader. We
- * could relax this restriction when currentCommandIdUsed was already
- * true at the start of the parallel operation.
- */
- Assert(!IsParallelWorker());
+ /*
+ * This is a temporary hack for all common parallel insert cases i.e.
+ * insert into, ctas, copy from. To be changed later. In a parallel
+ * worker, set currentCommandIdUsed to true only if it was not set to
+ * true at the start of the parallel operation (by way of
+ * SetCurrentCommandIdUsedForWorker()). We have to do this because
+ * GetCurrentCommandId(true) may be called from anywhere, especially
+ * for parallel inserts, within parallel worker.
+ */
+ Assert(!(IsParallelWorker() && !currentCommandIdUsed));

Why is this temporary hack? and what is the plan for removing this hack?

2.
+/*
+ * ChooseParallelInsertsInCTAS --- determine whether or not parallel
+ * insertion is possible, if yes set the parallel insert state i.e. push down
+ * the dest receiver to the Gather nodes.
+ */
+void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
+{
+ if (!IS_CTAS(into))
+ return;

When will this hit? The functtion name suggest that it is from CTAS
but now you have a check that if it is
not for CTAS then return, can you add the comment that when do you
expect this case?

Also the function name should start in a new line
i.e
void
ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)

3.
+/*
+ * ChooseParallelInsertsInCTAS --- determine whether or not parallel
+ * insertion is possible, if yes set the parallel insert state i.e. push down
+ * the dest receiver to the Gather nodes.
+ */

Push down to the Gather nodes? I think the right statement will be
push down below the Gather node.

4.
intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
{
DR_intorel *myState = (DR_intorel *) self;

+ if (myState->is_parallel_worker)
+ {
+ /* In the worker */

+ SetCurrentCommandIdUsedForWorker();
+ myState->output_cid = GetCurrentCommandId(false);
+ }
+ else
{
non-parallel worker code
}
}

I think instead of moving all the code related to non-parallel worker
in the else we can do better. This
will avoid unnecessary code movement.

4.
intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
{
DR_intorel *myState = (DR_intorel *) self;

-- Comment ->in parallel worker we don't need to crease dest recv blah blah
+ if (myState->is_parallel_worker)
{
--parallel worker handling--
return;
}

--non-parallel worker code stay right there, instead of moving to else

5.
+/*
+ * ChooseParallelInsertsInCTAS --- determine whether or not parallel
+ * insertion is possible, if yes set the parallel insert state i.e. push down
+ * the dest receiver to the Gather nodes.
+ */
+void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
+{

From function name and comments it appeared that this function will
return boolean saying whether
Parallel insert should be selected or not. I think name/comment
should be better for this

6.
/*
+ * For parallelizing inserts in CTAS i.e. making each parallel worker
+ * insert the tuples, we must send information such as into clause (for
+ * each worker to build separate dest receiver), object id (for each
+ * worker to open the created table).

Comment is saying we need to pass object id but the code under this
comment is not doing so.

7.
+ /*
+ * Since there are no rows that are transferred from workers to Gather
+ * node, so we set it to 0 to be visible in estimated row count of
+ * explain plans.
+ */
+ queryDesc->planstate->plan->plan_rows = 0;

This seems a bit hackies Why it is done after the planning, I mean
plan must know that it is returning a 0 rows?

8.
+ char *intoclause_space = shm_toc_allocate(pcxt->toc,
+ intoclause_len);
+ memcpy(intoclause_space, intoclausestr, intoclause_len);
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE, intoclause_space);

One blank line between variable declaration and next code segment,
take care at other places as well.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2020-12-26 06:18:21 Re: proposal: schema variables
Previous Message Pavel Stehule 2020-12-26 05:18:01 Re: pgsql: Add pg_alterckey utility to change the cluster key