Re: Parallel Inserts in CREATE TABLE AS

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: Parallel Inserts in CREATE TABLE AS
Date: 2020-09-28 10:28:03
Message-ID: CALj2ACWj+3H5TQqwxANZmdePEnSNxk-YAeT1c5WE184Gf75XUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Andres for the comments.

On Thu, Sep 24, 2020 at 8:11 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> > The design:
>
> I think it'd be good if you could explain a bit more why you think this
> safe to do in the way you have done it.
>
> E.g. from a quick scroll through the patch, there's not even a comment
> explaining that the only reason there doesn't need to be code dealing
> with xid assignment because we already did the catalog changes to create
> the table.
>

Yes we do a bunch of catalog changes related to the created new table.
We will have both the txn id and command id assigned when catalogue
changes are being made. But, right after the table is created in the
leader, the command id is incremented (CommandCounterIncrement() is
called from create_ctas_internal()) whereas the txn id remains the
same. The new command id is marked as GetCurrentCommandId(true); in
intorel_startup, then the parallel mode is entered. The txn id and
command id are serialized into parallel DSM, they are then available
to all parallel workers. This is discussed in [1].

Few changes I have to make in the parallel worker code: set
currentCommandIdUsed = true;, may be via a common API
SetCurrentCommandIdUsedForWorker() proposed in [1] and remove the
extra command id sharing from the leader to workers.

I will add a few comments in the upcoming patch related to the above info.

>
> But how does that work for SELECT INTO? Are you prohibiting
> that? ...
>

In case of SELECT INTO, a new table gets created and I'm not
prohibiting the parallel inserts and I think we don't need to.
Thoughts?

>
> > Below things are still pending. Thoughts are most welcome:
> > 1. How better we can lift the "cannot insert tuples in a parallel worker"
> > from heap_prepare_insert() for only CTAS cases or for that matter parallel
> > copy? How about having a variable in any of the worker global contexts and
> > use that? Of course, we can remove this restriction entirely in case we
> > fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY.
>
> And for the purpose of your question, we could then have a
> table_insert_allow_parallel(TableInsertScan *);
> or an additional arg to table_begin_insert().
>

Removing "cannot insert tuples in a parallel worker" restriction from
heap_prepare_insert() is a common problem for parallel inserts in
general, i.e. parallel inserts in CTAS, parallel INSERT INTO
SELECTs[1] and parallel copy[2]. It will be good if a common solution
is agreed.

>
> > 3. Need to restrict parallel inserts, if CTAS tries to create temp/global
> > tables as the workers will not have access to those tables. Need to analyze
> > whether to allow parallelism if CTAS has prepared statements or with no
> > data.
>
> In which case does CTAS not create a table?

AFAICS, the table gets created in all the cases but the insertion of
the data gets skipped if the user specifies "with no data" option in
which case the select part is not even planned, and so the parallelism
will also not be picked.

>
> You definitely need to
> ensure that the table is created before your workers are started, and
> there needs to be in a different CommandId.
>

Yeah, this is already being done. Table gets created in the
leader(intorel_startup which gets called from dest->rStartup(dest in
standard_ExecutorRun()) before entering the parallel mode.

[1] https://www.postgresql.org/message-id/CAJcOf-fn1nhEtaU91NvRuA3EbvbJGACMd4_c%2BUu3XU5VMv37Aw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1%2BkpddvvLxWm4BuG_AhVvYz8mKAEa7osxp_X0d4ZEiV%3Dg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2020-09-28 10:36:04 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message Pavel Stehule 2020-09-28 09:39:34 Re: [HACKERS] [PATCH] Generic type subscripting