Re: parallelize queries containing subplans

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallelize queries containing subplans
Date: 2017-02-14 23:08:20
Message-ID: CA+TgmoZYuT8i0LinomZv-o5V4ZRuPQ_V0ZCWabw9epK-=EwRTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 14, 2017 at 4:24 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On further evaluation, it seems this patch has one big problem which
> is that it will allow forming parallel plans which can't be supported
> with current infrastructure. For ex. marking immediate level params
> as parallel safe can generate below type of plan:
>
> Seq Scan on t1
> Filter: (SubPlan 1)
> SubPlan 1
> -> Gather
> Workers Planned: 1
> -> Result
> One-Time Filter: (t1.k = 0)
> -> Parallel Seq Scan on t2
>
>
> In this plan, we can't evaluate one-time filter (that contains
> correlated param) unless we have the capability to pass all kind of
> PARAM_EXEC param to workers. I don't want to invest too much time in
> this patch unless somebody can see some way using current parallel
> infrastructure to implement correlated subplans.

I don't think this approach has much chance of working; it just seems
too simplistic. I'm not entirely sure what the right approach is.
Unfortunately, the current query planner code seems to compute the
sets of parameters that are set and used quite late, and really only
on a per-subquery level. Here we need to know whether there is
anything that's set below the Gather node and used above it, or the
other way around, and we need to know it much earlier, while we're
still doing path generation. There doesn't seem to be any simple way
of getting that information, but I think you need it.

What's more, I think you would still need it even if you had the
ability to pass parameter values between processes. For example,
consider:

Gather
-> Parallel Seq Scan
Filter: (Correlated Subplan Reference Goes Here)

Of course, the Param in the filter condition *can't* be a shared Param
across all processes. It needs to be private to each process
participating in the parallel sequential scan -- and the params
passing data down from the Parallel Seq Scan to the correlated subplan
also need to be private. On the other hand, in your example quoted
above, you do need to share across processes: the value for t1.k needs
to get passed down. So it seems to me that we somehow need to
identify, for each parameter that gets used, whether it's provided by
something beneath the Gather node (in which case it should be private
to the worker) or whether it's provided from higher up (in which case
it should be passed down to the worker, or if we can't do that, then
don't use parallelism there).

(There's also possible a couple of other cases, like an initPlan that
needs to get executed only once, and also maybe a case where a
parameter is set below the Gather and later used above the Gather.
Not sure if that latter one happen, or how to deal with it.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-02-14 23:12:30 Re: Skipping PgStat_FunctionCallUsage for many expressions
Previous Message neha khatri 2017-02-14 23:04:24 Re: bytea_output vs make installcheck