Re: Parallel Inserts in CREATE TABLE AS

From: Luc Vlaming <luc(at)swarm64(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Zhihong Yu <zyu(at)yugabyte(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Inserts in CREATE TABLE AS
Date: 2021-01-05 07:13:07
Message-ID: ef02fe4e-b476-40e6-3fd6-3407caf82fb5@swarm64.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04-01-2021 14:32, Bharath Rupireddy wrote:
> On Mon, Jan 4, 2021 at 4:22 PM Luc Vlaming <luc(at)swarm64(dot)com
> <mailto:luc(at)swarm64(dot)com>> wrote:
> > Sorry it took so long to get back to reviewing this.
>
> Thanks for the comments.
>
> > wrt v18-0001....patch:
> >
> > +               /*
> > +                * If the worker is for parallel insert in CTAS, then
> use the proper
> > +                * dest receiver.
> > +                */
> > +               intoclause = (IntoClause *) stringToNode(intoclausestr);
> > +               receiver = CreateIntoRelDestReceiver(intoclause);
> > +               ((DR_intorel *)receiver)->is_parallel_worker = true;
> > +               ((DR_intorel *)receiver)->object_id = fpes->objectid;
> > I would move this into a function called e.g.
> > GetCTASParallelWorkerReceiver so that the details wrt CTAS can be put in
> > createas.c.
> > I would then also split up intorel_startup into intorel_leader_startup
> > and intorel_worker_startup, and in GetCTASParallelWorkerReceiver set
> > self->pub.rStartup to intorel_worker_startup.
>
> My intention was to not add any new APIs to the dest receiver. I simply
> made the changes in intorel_startup, in which for workers it just does
> the minimalistic work and exit from it. In the leader most of the table
> creation and sanity check is kept untouched. Please have a look at the
> v19 patch posted upthread [1].
>

Looks much better, really nicely abstracted away in the v20 patch.

> > +       volatile pg_atomic_uint64       *processed;
> > why is it volatile?
>
> Intention is to always read from the actual memory location. I referred
> it from the way pg_atomic_fetch_add_u64_impl,
> pg_atomic_compare_exchange_u64_impl, pg_atomic_init_u64_impl and their
> u32 counterparts use pass the parameter as volatile pg_atomic_uint64 *ptr.
>

Okay I had not seen this syntax before for atomics with the volatile
keyword but its apparently how the atomics abstraction works in postgresql.

> > +                       if (isctas)
> > +                       {
> > +                               intoclause = ((DR_intorel *)
> node->dest)->into;
> > +                               objectid = ((DR_intorel *)
> node->dest)->object_id;
> > +                       }
> > Given that you extract them each once and then pass them directly into
> > the parallel-worker, can't you instead pass in the destreceiver and
> > leave that logic to ExecInitParallelPlan?
>
> That's changed entirely in the v19 patch set posted upthread [1]. Please
> have a look. I didn't pass the dest receiver, to keep the API generic, I
> passed parallel insert command type and a void * ptr which points to
> insertion command because the information we pass to workers depends on
> the insertion command (for instance, the information needed by workers
> is for CTAS into clause and object id and for Refresh Mat View object id).
>
> >
> > +                                       if
> (IS_PARALLEL_CTAS_DEST(gstate->dest) &&
> > +                                               ((DR_intorel *)
> gstate->dest)->into->rel &&
> > +                                               ((DR_intorel *)
> gstate->dest)->into->rel->relname)
> > why would rel and relname not be there? if no rows have been inserted?
> > because it seems from the intorel_startup function that that would be
> > set as soon as startup was done, which i assume (wrongly?) is always
> done?
>
> Actually, that into clause rel variable is always being set in the
> gram.y for CTAS, Create Materialized View and SELECT INTO (because
> qualified_name non-terminal is not optional). My bad. I just added it as
> a sanity check. Actually, it's not required.
>
> create_as_target:
> *qualified_name* opt_column_list table_access_method_clause
>             OptWith OnCommitOption OptTableSpace
>                 {
>                     $$ = makeNode(IntoClause);
> *                    $$->rel = $1;*
> create_mv_target:
> *qualified_name* opt_column_list table_access_method_clause
> opt_reloptions OptTableSpace
>             {
>                 $$ = makeNode(IntoClause);
> *                $$->rel = $1;*
> into_clause:
>             INTO OptTempTableName
>             {
>                 $$ = makeNode(IntoClause);
> *               $$->rel = $2;*
>
> I will change the below code:
> +                    if (GetParallelInsertCmdType(gstate->dest) ==
> +                        PARALLEL_INSERT_CMD_CREATE_TABLE_AS &&
> +                        ((DR_intorel *) gstate->dest)->into &&
> +                        ((DR_intorel *) gstate->dest)->into->rel &&
> +                        ((DR_intorel *) gstate->dest)->into->rel->relname)
> +                    {
>
> to:
> +                    if (GetParallelInsertCmdType(gstate->dest) ==
> +                        PARALLEL_INSERT_CMD_CREATE_TABLE_AS)
> +                    {
>
> I will update this in the next version of the patch set.
>

Thanks
> >
> > +        * In case if no workers were launched, allow the leader to
> insert entire
> > +        * tuples.
> > what does "entire tuples" mean? should it maybe be "all tuples"?
>
> Yeah, noticed that while working on the v19 patch set. Please have a
> look at the v19 patch posted upthread [1].
>
> > ================
> > wrt v18-0003....patch:
> >
> > not sure if it is needed, but i was wondering if we would want more
> > tests with multiple gather nodes existing? caused e.g. by using CTE's,
> > valid subquery's (like the one test you have, but without the group
> > by/having)?
>
> I'm not sure if we can have CTAS/CMV/SELECT INTO in CTEs like WITH, WITH
> RECURSIVE and I don't see that any of the WITH clause processing hits
> createas.c functions. So, IMHO, we don't need to add them. Please let me
> know if there are any specific use cases you have in mind.
>
> For instance, I tried to cover Init/Sub Plan and Subquery cases with:
>
> below case has multiple Gather, Init Plan:
> +-- parallel inserts must occur, as there is init plan that gets executed by
> +-- each parallel worker
> +select explain_pictas(
> +'create table parallel_write as select two col1,
> +    (select two from (select * from tenk2) as tt limit 1) col2
> +    from tenk1  where tenk1.four = 3;');
>
> below case has Gather, Sub Plan:
> +-- parallel inserts must not occur, as there is sub plan that gets
> executed by
> +-- the Gather node in leader
> +select explain_pictas(
> +'create table parallel_write as select two col1,
> +    (select tenk1.two from generate_series(1,1)) col2
> +    from tenk1  where tenk1.four = 3;');
>
> For multiple Gather node cases, I covered them with the Union All/Append
> cases in the 0004 patch. Please have a look.
>

Right, had not reviewed part 4 yet. My bad.

> [1] -
> https://www.postgresql.org/message-id/CALj2ACWth7mVQtqdYJwSn1mNmaHwxNE7YSYxRSLmfkqxRk%2Bzmg%40mail.gmail.com
> <https://www.postgresql.org/message-id/CALj2ACWth7mVQtqdYJwSn1mNmaHwxNE7YSYxRSLmfkqxRk%2Bzmg%40mail.gmail.com>
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>

Kind regards,
Luc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-01-05 07:17:34 Re: pg_rewind restore_command issue in PG12
Previous Message Michael Paquier 2021-01-05 07:02:07 Re: Track replica origin progress for Rollback Prepared