Re: Parallel Inserts in CREATE TABLE AS

From: Luc Vlaming <luc(at)swarm64(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, Zhihong Yu <zyu(at)yugabyte(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: Parallel Inserts in CREATE TABLE AS
Date: 2021-01-05 07:54:56
Message-ID: 82691227-2dc7-7cb1-7e76-5911a32512e4@swarm64.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04-01-2021 14:53, Bharath Rupireddy wrote:
> On Mon, Jan 4, 2021 at 5:44 PM Luc Vlaming <luc(at)swarm64(dot)com> wrote:
>> On 04-01-2021 12:16, Hou, Zhijie wrote:
>>>> ================
>>>> wrt v18-0002....patch:
>>>>
>>>> It looks like this introduces a state machine that goes like:
>>>> - starts at CTAS_PARALLEL_INS_UNDEF
>>>> - possibly moves to CTAS_PARALLEL_INS_SELECT
>>>> - CTAS_PARALLEL_INS_TUP_COST_CAN_IGN can be added
>>>> - if both were added at some stage, we can go to
>>>> CTAS_PARALLEL_INS_TUP_COST_IGNORED and ignore the costs
>>>>
>>>> what i'm wondering is why you opted to put logic around
>>>> generate_useful_gather_paths and in cost_gather when to me it seems more
>>>> logical to put it in create_gather_path? i'm probably missing something
>>>> there?
>>>
>>> IMO, The reason is we want to make sure we only ignore the cost when Gather is the top node.
>>> And it seems the generate_useful_gather_paths called in apply_scanjoin_target_to_paths is the right place which can only create top node Gather.
>>> So we change the flag in apply_scanjoin_target_to_paths around generate_useful_gather_paths to identify the top node.
>
> Right. We wanted to ignore parallel tuple cost for only the upper Gather path.
>
>> I was wondering actually if we need the state machine. Reason is that as
>> AFAICS the code could be placed in create_gather_path, where you can
>> also check if it is a top gather node, whether the dest receiver is the
>> right type, etc? To me that seems like a nicer solution as its makes
>> that all logic that decides whether or not a parallel CTAS is valid is
>> in a single place instead of distributed over various places.
>
> IMO, we can't determine the fact that we are going to generate the top
> Gather path in create_gather_path. To decide on whether or not the top
> Gather path generation, I think it's not only required to check the
> root->query_level == 1 but we also need to rely on from where
> generate_useful_gather_paths gets called. For instance, for
> query_level 1, generate_useful_gather_paths gets called from 2 places
> in apply_scanjoin_target_to_paths. Likewise, create_gather_path also
> gets called from many places. IMO, the current way i.e. setting flag
> it in apply_scanjoin_target_to_paths and ignoring based on that in
> cost_gather seems safe.
>
> I may be wrong. Thoughts?
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>

So the way I understand it the requirements are:
- it needs to be the top-most gather
- it should not do anything with the rows after the gather node as this
would make the parallel inserts conceptually invalid.

Right now we're trying to judge what might be added on-top that could
change the rows by inspecting all parts of the root object that would
cause anything to be added, and add a little statemachine to track the
state of that knowledge. To me this has the downside that the list in
HAS_PARENT_PATH_GENERATING_CLAUSE has to be exhaustive, and we need to
make sure it stays up-to-date, which could result in regressions if not
tracked carefully.

Personally I would therefore go for a design which is safe in the sense
that regressions are not as easily introduced. IMHO that could be done
by inspecting the planned query afterwards, and then judging whether or
not the parallel inserts are actually the right thing to do.

Another way to create more safety against regressions would be to add an
assert upon execution of the query that if we do parallel inserts that
only a subset of allowed nodes exists above the gather node.

Some (not extremely fact checked) approaches as food for thought:
1. Plan the query as normal, and then afterwards look at the resulting
plan to see if there are only nodes that are ok between the gather node
and the top node, which afaics would only be things like append nodes.
Which would mean two things:
- at the end of subquery_planner before the final_rel is fetched, we add
another pass like the grouping_planner called e.g.
parallel_modify_planner or so, which traverses the query plan and checks
if the inserts would indeed be executed parallel, and if so sets the
cost of the gather to 0.
- we always keep around the best gathered partial path, or the partial
path itself.

2. Generate both gather paths: one with zero cost for the inserts and
one with costs. the one with zero costs would however be kept separately
and added as prime candidate for the final rel. then we can check in the
subquery_planner if the final candidate is different and then choose.

Kind regards,
Luc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-01-05 07:56:14 Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Previous Message Thomas Munro 2021-01-05 07:54:11 Re: Moving other hex functions to /common