Re: Parameterization of partial path

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parameterization of partial path
Date: 2017-02-09 18:06:47
Message-ID: CA+TgmobTjc0KxQoq5HCs4D+mtr0CKOeq2h0jkEB0zzFda_a8MA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 9, 2017 at 12:36 PM, Antonin Houska <ah(at)cybertec(dot)at> wrote:
> When looking at try_partial_hashjoin_path and try_partial_nestloop_path
> functions I'm failing to understand the comment "Parameterized partial paths
> are not supported.".
>
> It's clear to me that GatherPath can't have parameters because repeated
> execution of parallel plan with adjusted parameters is not supported.

Actually, I think in theory that is fine. You'd start up and shut
down workers for every execution, which is likely to stink in terms of
performance, but it should work OK. The only problem is that it'll
only work if you pass the values of the parameters down to the worker
processes, which the code currently doesn't. Bind parameters sent by
the user are passed down, but there's no provision to pass down
parameters populated at execution time. Amit has written code for
that, though, and it's been posted here. It just hasn't gotten into
the tree yet for one reason or another.

> But the
> fact that generic partial path has parameters should not be a problem if those
> parameters are satisfied below the nearest GatherPath node higher in the plan
> tree. Do I miss anything of the concept?

Yes, I think you're missing a key point. A parameterized path would
normally be used for a query like small LJ (big1 IJ big2 ON big1.x =
big2.x) ON big1.y = small.y. The plan might look like this:

Nested Loop Left Join
-> Seq Scan on small
-> Nested Loop Join
-> Index Scan on big1
-> Index Scan on big2

In essence, we're repeating the join between big1 and big2 for every
row in small. That seems like a bad strategy until you realize that
each join will scan only a tiny fraction of each of those tables. If
you couldn't pass a parameter down from the scan of small to the scans
on big1 and big2, you'd end up with a plan that scans one or both of
those tables in their entirety. Ouch.

Now, this plan can be parallelized just fine. The sequential scan on
small can be replaced with a Parallel Seq Scan. That works fine, and
the planner is already capable of generating such plans. However, at
no point in that do you get a parameterized *partial* path. You
generate regular old parameterized paths for big1 and big2 and join
then to produce a parameterized path for (big1 big2), and then you
join that via a nested loop with the non-parameterized partial path
for small, and you get a partial but unparameterized path for (small
big1 big2). Then you push a Gather node on top and you're done.

Suppose we did generate a partial plan for (big1 big2). It would look
like this:

Nested Loop Join
-> Parallel Index Scan on big1
-> Index Scan on big2

Now, how could you join that in a meaningful way to small so as to
come up with anything sensible? You really can't. Consider a plan
like this:

Gather
-> Nested Loop Left Join
-> Seq Scan on small
-> Nested Loop Join
-> Partial Index Scan on big1
-> Index Scan on big2

That's clearly nonsense. The partial index scan is supposed to return
a subset of the result rows to each worker, but there's no reason the
workers have to be basing their work on the same row from 'small'.
Which of their values would the parallel index scan supposedly be
using? This isn't a matter of some implementation detail that we're
currently missing; such a plan is just inherently nonsensical.

> Actually this check
>
> if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
> return;
>
> in try_partial_nestloop_path only rejects special case where the inner path
> references relations outside the join, but still allows the outer path to have
> parameters outside.

Right. We build a partial join path by joining a partial path on the
outer side with a non-partial path on the inner side. If the join is
a nested loop, the inner side can be parameterized, but all of those
parameters have to be provided by the outer side; if not, we get the
nonsensical situation illustrated above.

> As for try_partial_hashjoin_path, the comment "If the inner path is
> parameterized ..." seems to be copy & pasted from try_partial_nestloop_path,
> but I think it does not fit hash join. From hash_inner_and_outer I judge that
> neither side of hash join can be parameterized by the other:
>
> /*
> * If either cheapest-total path is parameterized by the other rel, we
> * can't use a hashjoin. (There's no use looking for alternative
> * input paths, since these should already be the least-parameterized
> * available paths.)
> */
> if (PATH_PARAM_BY_REL(cheapest_total_outer, innerrel) ||
> PATH_PARAM_BY_REL(cheapest_total_inner, outerrel))
> return;
>
> Shouldn't try_partial_hashjoin_path and try_partial_nestloop_path do just the
> same checks that try_hashjoin_path and try_nestloop_path do respectively?

No, because there is no point in producing parameterized partial paths
for the reasons mentioned above. I think you're right that the
comment in try_partial_hashjoin_path is not quite right. What it
should really be saying is that a hash join between a partial path and
a parameterized path is bound to produce a parameterized partial path,
and those aren't useful for anything, so we shouldn't create them.

Maybe someday somebody will figure out a context in which a partial
parameterized path is actually good for something, but until then we
shouldn't generate them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Sharma 2017-02-09 18:11:15 Re: pageinspect: Hash index support
Previous Message Peter Geoghegan 2017-02-09 18:01:06 Re: ICU integration