Quick Links

Re: Parallel query execution introduces performance regressions

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Jinho Jung <jinhojun(at)usc(dot)edu>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: Parallel query execution introduces performance regressions
Date:	2019-04-01 19:00:22
Message-ID:	20190401190022.ccugidpkncnckqli@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hi,

On 2019-04-01 11:52:54 -0700, Peter Geoghegan wrote:
> On Mon, Apr 1, 2019 at 11:30 AM Jinho Jung <jinhojun(at)usc(dot)edu> wrote:
> > Surprisingly, we found that even on a larger TPC-C database (scale factor of 50, roughly 4GB of size), parallel scan is still slower than the non-parallel execution plan in the old version.
>
> That's not a large database, and it's certainly not a large TPC-C
> database. If you attempt to stay under the spec's maximum
> tpmC/throughput per warehouse, which is 12.86 tpmC per warehouse, then
> you'll need several thousand warehouses on modern hardware. We're
> talking several hundred gigabytes. Otherwise, as far as the spec is
> concerned you're testing an unrealistic workload. There will be
> individual customers that make many more purchases than is humanly
> possible. You're modelling an app involving hypothetical warehouse
> employees that must enter data into their terminals at a rate that is
> not humanly possible.

I don't think that's really the problem here. It's that there's a
fundamental misestimation in the query:

> [OLD version]
> Nested Loop Semi Join (cost=0.00..90020417940.08 rows=30005835 width=8)
> (actual time=0.034..24981.895 rows=90017507 loops=1)
> Join Filter: (ref_0.ol_d_id <= ref_1.i_im_id)
> -> Seq Scan on order_line ref_0 (cost=0.00..2011503.04 rows=90017504
> width=12) (actual time=0.022..7145.811 rows=90017507 loops=1)
> -> Materialize (cost=0.00..2771.00 rows=100000 width=4) (actual
> time=0.000..0.000 rows=1 loops=90017507)
> -> Seq Scan on item ref_1 (cost=0.00..2271.00 rows=100000 width=4)
> (actual time=0.006..0.006 rows=1 loops=1)

note the estimated rows=100000 vs the actual rows=1 in the seqscan /
materialize. That's what makes the planner think this is much more
expensive than it is, which in turn triggers the use of a parallel scan.

Greetings,

Andres Freund

In response to

Re: Parallel query execution introduces performance regressions at 2019-04-01 18:52:54 from Peter Geoghegan

Responses

Re: Parallel query execution introduces performance regressions at 2019-04-01 19:05:57 from Peter Geoghegan
Re: Parallel query execution introduces performance regressions at 2019-04-01 19:08:59 from Jinho Jung
Re: Parallel query execution introduces performance regressions at 2019-04-02 03:21:00 from David Rowley

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Peter Geoghegan	2019-04-01 19:05:57	Re: Parallel query execution introduces performance regressions
Previous Message	Peter Geoghegan	2019-04-01 18:52:54	Re: Parallel query execution introduces performance regressions