Re: Parallel Inserts in CREATE TABLE AS

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Luc Vlaming <luc(at)swarm64(dot)com>
Cc: "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Inserts in CREATE TABLE AS
Date: 2020-11-30 05:13:05
Message-ID: CALj2ACUwNSKFTq59L7vJ_4BxZL6xmzS92eH9207uGm4guVAK2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 27, 2020 at 1:07 PM Luc Vlaming <luc(at)swarm64(dot)com> wrote:
>
> Disclaimer: I have by no means throughly reviewed all the involved parts
> and am probably missing quite a bit of context so if I understood parts
> wrong or they have been discussed before then I'm sorry. Most notably
> the whole situation about the command-id is still elusive for me and I
> can really not judge yet anything related to that.
>
> IMHO The patch makes that we now have the gather do most of the CTAS
> work, which seems unwanted. For the non-ctas insert/update case it seems
> that a modifytable node exists to actually do the work. What I'm
> wondering is if it is maybe not better to introduce a CreateTable node
> as well?
> This would have several merits:
> - the rowcount of that node would be 0 for the parallel case, and
> non-zero for the serial case. Then the gather ndoe and the Query struct
> don't have to know about CTAS for the most part, removing e.g. the case
> distinctions in cost_gather.
> - the inserted rows can now be accounted in this new node instead of the
> parallel executor state, and this node can also do its own DSM
> intializations
> - the generation of a partial variants of the CreateTable node can now
> be done in the optimizer instead of the ExecCreateTableAs which IMHO is
> a more logical place to make these kind of decisions. which then also
> makes it potentially play nicer with costs and the like.
> - the explain code can now be in its own place instead of part of the
> gather node
> - IIUC it would allow the removal of the code to only launch parallel
> workers if its not CTAS, which IMHO would be quite a big benefit.
>
> Thoughts?
>

If I'm not wrong, I think currently we have no exec nodes for DDLs.
I'm not sure whether we would like to introduce one for this. And also
note that, both CTAS and CREATE MATERIALIZED VIEW(CMV) are handled
with the same code, so if we have CreateTable as the new node, then do
we also want to have another node or a generic node name?

The main design idea of the patch proposed in this thread is that
pushing the dest receiver down to the workers if the SELECT part of
the CTAS or CMV is parallelizable. And also, for CTAS or CMV we do not
do any planning as such, but the planner is just influenced to take
into consideration that there are no tuples to transfer from the
workers to Gather node which may make the planner choose parallelism
for SELECT part. So, the planner work for CTAS or CMV is very minimal.
I also have the idea of extending this design (if accepted) to REFRESH
MATERIALIZED VIEW after some analysis.

I may be wrong above, other hackers may have better opinions.

>
> Some small things I noticed while going through the patch:
> - Typo for the comment about "inintorel_startup" which should be
> intorel_startup
>

Corrected.

>
> - if (node->nworkers_launched == 0 && !node->need_to_scan_locally)
>
> can be changed into
> if (node->nworkers_launched == 0
> because either way it'll be true.
>

Yes, !node->need_to_scan_locally is not necessary, we need to set it
to true if there are no workers launched. I removed
!node->need_to_scan_locally check from the if clause.

> On Fri, Nov 27, 2020 at 11:57 AM Hou, Zhijie <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com> wrote:
> >
> > > Thanks a lot for the use case. Yes with the current patch table will lose
> > > data related to the subplan. On analyzing further, I think we can not allow
> > > parallel inserts in the cases when the Gather node has some projections
> > > to do. Because the workers can not perform that projection. So, having
> > > ps_ProjInfo in the Gather node is an indication for us to disable parallel
> > > inserts and only the leader can do the insertions after the Gather node
> > > does the required projections.
> > >
> > > Thoughts?
> >
> > Agreed.
>
> Thanks! I will add/modify IsParallelInsertInCTASAllowed() to return
> false in this case.
>

Modified.

Attaching v6 patch that has the above review comments addressed.
Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v6-0001-Parallel-Inserts-in-CREATE-TABLE-AS.patch application/x-patch 47.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2020-11-30 05:18:48 Re: Multi Inserts in CREATE TABLE AS - revived patch
Previous Message Kyotaro Horiguchi 2020-11-30 05:07:20 Re: Disable WAL logging to speed up data loading