Re: Parallel INSERT (INTO ... SELECT ...)

From: Greg Nancarrow <gregn4422(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel INSERT (INTO ... SELECT ...)
Date: 2020-11-02 06:39:00
Message-ID: CAJcOf-fhdbhP7PeKf-sgRuZG4DLPyrbQUbub=ho=A9dyNhhrnw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 30, 2020 at 5:00 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> > So then to avoid
> > errors from when parallel-mode is forced on and such unsafe INSERTs
> > are run, the only real choice is to only allow parallelModeNeeded to
> > be true for SELECT only (not INSERT), and this is kind of cheating and
> > also not picking up cases where parallel-safe INSERT is run but
> > invokes parallel-mode-unsafe features.
> > My conclusion, at least for the moment, is to leave the check where it is.
> >
>
> Okay, then can we integrate the functionality of
> MaxParallelHazardForModify in max_parallel_hazard? Calling it
> separately looks bit awkward.
>

Looking into that.

> >
> > > Few other comments on latest patch:
> > > ===============================
> > > 1.
> > > MaxRelParallelHazardForModify()
> > > {
> > > ..
> > > + if (commandType == CMD_INSERT || commandType == CMD_UPDATE)
> > > + {
> > > + /*
> > > ..
> > >
> > > Why to check CMD_UPDATE here?
> > >
> >
> > That was a bit of forward-thinking, for when/if UPDATE/DELETE is
> > supported in parallel-mode.
> > Column default expressions and check-constraints are only applicable
> > to INSERT and UPDATE.
> > Note however that currently this function can only ever be called with
> > commandType == CMD_INSERT.
> >
>
> I feel then for other command types there should be an Assert rather
> than try to handle something which is not yet implemented nor it is
> clear what all is required for that. It confuses the reader, at least
> it confused me. Probably we can write a comment but I don't think we
> should have any check for Update at this stage of work.
>

OK, for now I'll restrict the checks to INSERT, but I'll add comments
to assist with potential future UPDATE support.

> > > 2.
> > > +void PrepareParallelModeForModify(CmdType commandType, bool
> > > isParallelModifyLeader)
> > > +{
> > > + Assert(!IsInParallelMode());
> > > +
> > > + if (isParallelModifyLeader)
> > > + (void)GetCurrentCommandId(true);
> > > +
> > > + (void)GetCurrentFullTransactionId();
> > >
> > > Here, we should use GetCurrentTransactionId() similar to heap_insert
> > > or other heap operations. I am not sure why you have used
> > > GetCurrentFullTransactionId?
> > >
> >
> > GetCurrentTransactionId() and GetCurrentFullTransactionId() actually
> > have the same functionality, just a different return value (which is
> > not being used here).
> >
>
> Sure but lets use what is required.
>
> > But anyway I've changed it to use GetCurrentTransactionId().
> >
>
> But comments in ExecutePlan and PrepareParallelModeForModify still
> refer to FullTransactionId.
>

I believe those comments are technically correct.
GetCurrentTransactionId() calls AssignTransactionId() to do all the
work - and the comment for that function says "Assigns a new permanent
FullTransactionId to the given TransactionState".

> >
> >
> > > 4. Have you checked the overhead of this on the planner for different
> > > kinds of statements like inserts into tables having 100 or 500
> > > partitions? Similarly, it is good to check the overhead of domain
> > > related checks added in the patch.
> > >
> >
> > Checking that now and will post results soon.
> >
>

I am seeing a fair bit of overhead in the planning for the INSERT
parallel-safety checks (mind you, compared to the overall performance
gain, it's not too bad).
Some representative timings for a parallel INSERT of a millions rows
into 100,250 and 500 partitions are shown below.

(1) Without patch

# Partitions Planning Time (ms)
Execution Time (ms)
100 1.014
4176.435
250 0.404
3842.414
500 0.529
4440.633

(2) With Parallel INSERT patch

# Partitions Planning Time (ms)
Execution Time (ms)
100 11.420
2131.148
250 23.269
3472.259
500 36.531
3238.868

I'm looking into how this can be improved by better integration into
the current code, and addressing locking concerns that you've
previously mentioned.

Regards,
Greg Nancarrow
Fujitsu Australia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Luc Vlaming 2020-11-02 06:53:45 Re: should INSERT SELECT use a BulkInsertState?
Previous Message Michael Paquier 2020-11-02 06:22:03 Re: bulk typos