Re: plan_rows confusion with parallel queries

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plan_rows confusion with parallel queries
Date: 2017-01-11 18:24:38
Message-ID: CA+TgmoaDxZ5z5Kw_oCQoymNxNoVaTCXzPaODcOuao=CzK8dMZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 2, 2016 at 10:54 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> I got confused by that a minute ago, so no you're not alone. The problem
>>>> is even worse in join cases. For example:
>>>> Gather (cost=34332.00..53265.35 rows=100 width=8)
>>>> Workers Planned: 2
>>>> -> Hash Join (cost=33332.00..52255.35 rows=100 width=8)
>>>> Hash Cond: ((pp.f1 = cc.f1) AND (pp.f2 = cc.f2))
>>>> -> Append (cost=0.00..8614.96 rows=417996 width=8)
>>>> -> Parallel Seq Scan on pp (cost=0.00..8591.67 rows=416667 width=8)
>>>> -> Parallel Seq Scan on pp1 (cost=0.00..23.29 rows=1329 width=8)
>>>> -> Hash (cost=14425.00..14425.00 rows=1000000 width=8)
>>>> -> Seq Scan on cc (cost=0.00..14425.00 rows=1000000 width=8)
>> Although - it is estimating 1M rows, but only "per worker" estimates are
>> shown, and because there are 2 workers planned it says 1M/2.4 which is
>> the 416k. I agree it's a bit unclear, but at least it's consistent with
>> how we treat loops (i.e. that the numbers are per loop).
>
> Well, it's not *that* consistent. If we were estimating all the numbers
> underneath the Gather as being per-worker numbers, that would make some
> amount of sense. But neither the other seqscan, nor the hash on it, nor
> the hashjoin's output count are scaled that way. It's very hard to call
> the above display anything but flat-out broken.

While investigating why Rushabh Lathia's Gather Merge patch sometimes
fails to pick a Gather Merge plan even when it really ought to do so,
I ran smack into this problem. I discovered that this is more than a
cosmetic issue. The costing itself is actually badly broken. In the
single-table case, when you have just ...

Gather
-> Parallel Seq Scan

...the Parallel Seq Scan node reflects a per-worker row estimate, and
the Gather node reflects a total row estimate. But in the join case,
as shown above, the Gather thinks that the total number of rows which
it will produce is equal to the number that will be produced by one
single worker, which is crap, and the cost of doing the join in
parallel is based on the per-worker rather than the total number,
which is crappier. The difference in cost between the Gather and the
underlying join in the above example is exactly 1010, namely 1000 for
parallel_setup_cost and 100 tuples at 0.1 per tuple, even though 100
is the number of tuples per-worker, not the total number. That's
really not good. I probably should have realized this when I looked
at this thread the first time, but I somehow got it into my head that
this was just a complaint about the imperfections of the display
(which is indeed imperfect) and failed to realize that the same report
was also pointing to an actual costing bug. I apologize for that.

The reason why this is happening is that final_cost_nestloop(),
final_cost_hashjoin(), and final_cost_mergejoin() don't care a whit
about whether the path they are generating is partial. They apply the
row estimate for the joinrel itself to every such path generated for
the join, except for parameterized paths which are a special case. I
think this generally has the effect of discouraging parallel joins,
because the inflated row count also inflates the join cost. I think
the right thing to do is probably to scale the row count estimate for
the joinrel's partial paths by the leader_contribution value computed
in cost_seqscan.

Despite my general hatred of back-patching things that cause plan
changes, I'm inclined to think the fix for this should be back-patched
to 9.6, because this is really a brown-paper-bag bug. If the
consensus is otherwise I will of course defer to that consensus.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-01-11 18:57:42 Re: WIP: [[Parallel] Shared] Hash
Previous Message Simon Riggs 2017-01-11 17:53:39 Re: Proposal for changes to recovery.conf API