Re: Parallel Inserts in CREATE TABLE AS

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Luc Vlaming <luc(at)swarm64(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Zhihong Yu <zyu(at)yugabyte(dot)com>
Subject: Re: Parallel Inserts in CREATE TABLE AS
Date: 2020-12-07 10:13:59
Message-ID: CALj2ACWTBKi3+-tnfeHp+cJqwMFuxkwekLkt1n1VLfT4Dqw9zQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 7, 2020 at 2:55 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Dec 7, 2020 at 11:32 AM Hou, Zhijie <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com> wrote:
> >
> > Hi
> >
> > + /*
> > + * Flag to let the planner know that the SELECT query is for CTAS. This is
> > + * used to calculate the tuple transfer cost from workers to gather node(in
> > + * case parallelism kicks in for the SELECT part of the CTAS), to zero as
> > + * each worker will insert its share of tuples in parallel.
> > + */
> > + if (IsParallelInsertInCTASAllowed(into, NULL))
> > + query->isForCTAS = true;
> >
> >
> > + /*
> > + * We do not compute the parallel_tuple_cost for CTAS because the number of
> > + * tuples that are transferred from workers to the gather node is zero as
> > + * each worker, in parallel, inserts the tuples that are resulted from its
> > + * chunk of plan execution. This change may make the parallel plan cheap
> > + * among all other plans, and influence the planner to consider this
> > + * parallel plan.
> > + */
> > + if (!(root->parse->isForCTAS &&
> > + root->query_level == 1))
> > + run_cost += parallel_tuple_cost * path->path.rows;
> >
> > I noticed that the parallel_tuple_cost will still be ignored,
> > When Gather is not the top node.
> >
> > Example:
> > Create table test(i int);
> > insert into test values(generate_series(1,10000000,1));
> > explain create table ntest3 as select * from test where i < 200 limit 10000;
> >
> > QUERY PLAN
> > -------------------------------------------------------------------------------
> > Limit (cost=1000.00..97331.33 rows=1000 width=4)
> > -> Gather (cost=1000.00..97331.33 rows=1000 width=4)
> > Workers Planned: 2
> > -> Parallel Seq Scan on test (cost=0.00..96331.33 rows=417 width=4)
> > Filter: (i < 200)
> >
> >
> > The isForCTAS will be true because [create table as], the
> > query_level is always 1 because there is no subquery.
> > So even if gather is not the top node, parallel cost will still be ignored.
> >
> > Is that works as expected ?
> >
>
> I don't think that is expected and is not the case without this patch.
> The cost shouldn't be changed for existing cases where the write is
> not pushed to workers.
>

Thanks for pointing that out. Yes it should not change for the cases
where parallel inserts will not be picked later.

Any better suggestions on how to make the planner consider that the
CTAS might choose parallel inserts later at the same time avoiding the
above issue in case it doesn't?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2020-12-07 10:28:53 Re: Add primary keys to system catalogs
Previous Message Amit Kapila 2020-12-07 10:12:20 Re: RFC: Deadlock detector hooks for victim selection and edge injection