Re: parallelize queries containing initplans

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallelize queries containing initplans
Date: 2017-10-03 11:33:57
Message-ID: CAA4eK1+vqPfTV9KpS2eGP1vqFy21=7EsQxFOMy_PoiBt5MgQ-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Mon, Oct 2, 2017 at 8:13 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Sep 14, 2017 at 10:45 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> The latest patch again needs to be rebased. Find rebased patch
>> attached with this email.
>
> I read through this patch this morning. Copying Tom in the hopes
> that he might chime in on the following two issues in particular:
>

It seems you forgot to include Tom, included now.

> 1. Is there any superior alternative to adding ptype to ParamExecData?
> I think the reason why we don't have this already is because the plan
> tree that populates the Param must output the right type and the plan
> tree that reads the Param must expect the right type, and there's no
> need for anything else. But for serialization and deserialization
> this seems to be necessary. I wonder whether it would be better to
> try to capture this at the time the Param is generated (e.g.
> var->vartype in assign_param_for_var) rather than derived at execution
> time by applying exprType(), but I'm not sure how that would work
> exactly, or whether it's really any better.
>
> 2. Do max_parallel_hazard_walker and set_param_references() have the
> right idea about which parameters are acceptable candidates for this
> optimization? The idea seems to be to collect the setParam sets for
> all initplans between the current query level and the root. That
> looks correct to me but I'm not an expert on this parameter stuff.
>
> Some other comments:
>
> - I think parallel_hazard_walker should likely work by setting
> safe_param_ids to the right set of parameter IDs rather than testing
> whether the parameter is safe by checking either whether it is in
> safe_param_ids or some other condition is met.
>

Okay, I think it should work the way you are suggesting. I will give it a try.

> - contains_parallel_unsafe_param() sounds like a function that merely
> returns true or false, but it actually has major side effects. Those
> side effects also look unsafe; mutating previously-generated paths can
> corrupt the rel's pathlist, because it will no longer have the sort
> order and other characteristics that add_path() creates and upon which
> other code relies.
>
> - Can't is_initplan_is_below_current_query_level() be confused when
> there are multiple subqueries in the tree? For example if the
> toplevel query has subqueries a and b and a has a sub-subquery aa
> which has an initplan, won't this function think that b has an
> initplan below the current query level? If not, maybe a comment is in
> order explaining why not?
>

We can discuss above two points once you are convinced about whether
we need any such functions, so let's first discuss that.

> - Why do we even need contains_parallel_unsafe_param() and
> is_initplan_is_below_current_query_level() in the first place, anyway?
> I think there's been some discussion of that on this thread, but I'm
> not sure I completely understand it, and the comments in the patch
> don't help me understand why we now need this restriction.
>

This is to ensure that initplans are never below Gather node. If we
allow parallelism when initplan is below current query level, (a) then
we need a way to pass the result of initplan from worker to other
workers and to master backend (b) if initplan contains reference to
some parameter above the current query level (undirect correlated
plans), then we need a mechanism to pass such parameters (basically
allow correlated initplans work).

Now, one might think (a) and or (b) is desirable so that it can be
used in more number of cases, but based on TPC-H analysis we have
found that it is quite useful even without those and in fact after we
support those cases the benefits might not be significant.

The patch contains few comments in generate_gather_paths and at the
top of functions contains_parallel_unsafe_param and
is_initplan_is_below_current_query_level, if you feel it should be
explained in further detail, then let me know.

> - The new code in ExplainPrintPlan() needs a comment.
>

There was a comment, but I have added a somewhat detailed comment now,
check if that makes it clear.

> - I have typically referred in comments to "Gather or Gather Merge"
> rather than "gather or gather merge".
>

Changed as per suggestion.

Changed funcation name is_initplan_is_below_current_query_level to
is_initplan_below_current_query_level as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
pq_pushdown_initplan_v10.patch application/octet-stream 37.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-10-03 11:52:13 Re: [PATCH] Incremental sort
Previous Message Alvaro Herrera 2017-10-03 11:24:16 Re: [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM