Re: Parallel Inserts in CREATE TABLE AS

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Luc Vlaming <luc(at)swarm64(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Inserts in CREATE TABLE AS
Date: 2020-12-27 19:47:56
Message-ID: CALNJ-vRYzKbjTF3p++8tpVOK5Y5qroV3gYirCEdFnEsuE1g3tw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For v16-0002-Tuple-Cost-Adjustment-for-Parallel-Inserts-in-CTAS.patch:

+ if (ignore &&
+ (root->parse->CTASParallelInsInfo &
+ CTAS_PARALLEL_INS_TUP_COST_CAN_IGN))

I wonder why CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is checked again in the
above if since when ignore_parallel_tuple_cost returns
true, CTAS_PARALLEL_INS_TUP_COST_CAN_IGN is set already.

+ * In this function we only care Append and Gather nodes.

'care' -> 'care about'

+ for (int i = 0; i < aps->as_nplans; i++)
+ {
+ parallel |= PushDownCTASParallelInsertState(dest,
+ aps->appendplans[i],
+ gather_exists);

It seems the loop termination condition can include parallel since we can
come out of the loop once parallel is true.

+ if (!allow && tuple_cost_flags && gather_exists)

As the above code shows, gather_exists is only checked when allow is false.

+ * We set the flag for two cases when there is no parent path
will
+ * be created(such as : limit,sort,distinct...):

Please correct the grammar : there are two verbs following 'when'

For set_append_rel_size:

+ {
+ root->parse->CTASParallelInsInfo |=
+
CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND;
+ }
+ }
+
+ if (root->parse->CTASParallelInsInfo &
+ CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND)
+ {
+ root->parse->CTASParallelInsInfo &=
+
~CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND;

In the if block for childrel->rtekind ==
RTE_SUBQUERY, CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND maybe set. Why is it
cleared immediately after ?

+ /* Set to this in case tuple cost needs to be ignored for Append cases.
*/
+ CTAS_PARALLEL_INS_IGN_TUP_COST_APPEND = 1 << 3

Since each CTAS_PARALLEL_INS_ flag is a bit, maybe it's better to use 'turn
on' or similar term in the comment. Because 'set to' normally means
assignment.

Cheers

On Sun, Dec 27, 2020 at 12:50 AM Bharath Rupireddy <
bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:

> On Sat, Dec 26, 2020 at 11:11 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com>
> wrote:
> > I have reviewed part of v15-0001 patch, I have a few comments, I will
> > continue to review this.
>
> Thanks a lot.
>
> > 1.
> > Why is this temporary hack? and what is the plan for removing this hack?
>
> The changes in xact.c, xact.h and heapam.c are common to all the
> parallel insert patches - COPY, INSERT INTO SELECT. That was the
> initial comment, I forgot to keep it in sync with the other patches.
> Now, I used the comment from INSERT INTO SELECT patch. IIRC, the plan
> was to have these code in all the parallel inserts patch, whichever
> gets to review and commit first, others will update their patches
> accordingly.
>
> > 2.
> > +/*
> > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> > + * insertion is possible, if yes set the parallel insert state i.e.
> push down
> > + * the dest receiver to the Gather nodes.
> > + */
> > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
> > +{
> > + if (!IS_CTAS(into))
> > + return;
> >
> > When will this hit? The functtion name suggest that it is from CTAS
> > but now you have a check that if it is
> > not for CTAS then return, can you add the comment that when do you
> > expect this case?
>
> Yes it will hit for explain cases, but I choose to remove this and
> check outside in the explain something like:
> if (into)
> ChooseParallelInsertsInCTAS()
>
> > Also the function name should start in a new line
> > i.e
> > void
> > ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
>
> Ah, missed that. Modified now.
>
> > 3.
> > +/*
> > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> > + * insertion is possible, if yes set the parallel insert state i.e.
> push down
> > + * the dest receiver to the Gather nodes.
> > + */
> >
> > Push down to the Gather nodes? I think the right statement will be
> > push down below the Gather node.
>
> Modified.
>
> > 4.
> > intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
> > {
> > DR_intorel *myState = (DR_intorel *) self;
> >
> > -- Comment ->in parallel worker we don't need to crease dest recv
> blah blah
> > + if (myState->is_parallel_worker)
> > {
> > --parallel worker handling--
> > return;
> > }
> >
> > --non-parallel worker code stay right there, instead of moving to
> else
>
> Done.
>
> > 5.
> > +/*
> > + * ChooseParallelInsertsInCTAS --- determine whether or not parallel
> > + * insertion is possible, if yes set the parallel insert state i.e.
> push down
> > + * the dest receiver to the Gather nodes.
> > + */
> > +void ChooseParallelInsertsInCTAS(IntoClause *into, QueryDesc *queryDesc)
> > +{
> >
> > From function name and comments it appeared that this function will
> > return boolean saying whether
> > Parallel insert should be selected or not. I think name/comment
> > should be better for this
>
> Yeah that function can still return void because no point in returning
> bool there, since the intention is to see if parallel inserts can be
> performed, if yes, set the state otherwise exit. I changed the
> function name to TryParallelizingInsertsInCTAS(). Let me know your
> suggestions if that doesn't work out.
>
> > 6.
> > /*
> > + * For parallelizing inserts in CTAS i.e. making each parallel
> worker
> > + * insert the tuples, we must send information such as into
> clause (for
> > + * each worker to build separate dest receiver), object id (for
> each
> > + * worker to open the created table).
> >
> > Comment is saying we need to pass object id but the code under this
> > comment is not doing so.
>
> Improved the comment.
>
> > 7.
> > + /*
> > + * Since there are no rows that are transferred from workers to
> Gather
> > + * node, so we set it to 0 to be visible in estimated row count
> of
> > + * explain plans.
> > + */
> > + queryDesc->planstate->plan->plan_rows = 0;
> >
> > This seems a bit hackies Why it is done after the planning, I mean
> > plan must know that it is returning a 0 rows?
>
> This exists to show up the estimated row count(in case of EXPLAIN CTAS
> without ANALYZE) in the output. For EXPLAIN ANALYZE CTAS actual tuples
> are shown correctly as 0 because Gather doesn't receive any tuples.
> if (es->costs)
> {
> if (es->format == EXPLAIN_FORMAT_TEXT)
> {
> appendStringInfo(es->str, " (cost=%.2f..%.2f rows=%.0f
> width=%d)",
> plan->startup_cost, plan->total_cost,
> plan->plan_rows, plan->plan_width);
>
> Since it's an estimated row count(which may not be always correct), we
> will let the EXPLAIN plan show that and I think we can remove that
> part. Thoughts?
>
> I removed it in v6 patch set.
>
> > 8.
> > + char *intoclause_space = shm_toc_allocate(pcxt->toc,
> > + intoclause_len);
> > + memcpy(intoclause_space, intoclausestr, intoclause_len);
> > + shm_toc_insert(pcxt->toc, PARALLEL_KEY_INTO_CLAUSE,
> intoclause_space);
> >
> > One blank line between variable declaration and next code segment,
> > take care at other places as well.
>
> Done.
>
> I'm attaching the v16 patch set. Please note that I added the
> documentation saying that parallel insertions can happen and a sample
> output of the explain to 0003 patch as discussed in [1]. But I didn't
> move the explain output related code to a separate patch because it's
> a small snippet in explain.c. I hope that's okay.
>
> [1] -
> https://www.postgresql.org/message-id/CAA4eK1JqwXGYoGa1%2B3-f0T50dBGufvKaKQOee_AfFhygZ6QKtA%40mail.gmail.com
>
>
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2020-12-27 20:02:30 hash_extension(text)
Previous Message Alexander Korotkov 2020-12-27 19:38:38 Re: range_agg