Re: WIP: [[Parallel] Shared] Hash

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-03-18 01:30:23
Message-ID: CAEepm=2fE0UBOXzaBvvW4HsQZDQG4MpHBFai_T0iou0oA_VBPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 14, 2017 at 8:03 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Mon, Mar 13, 2017 at 8:40 PM, Rafia Sabih
> <rafia(dot)sabih(at)enterprisedb(dot)com> wrote:
>> In an attempt to test v7 of this patch on TPC-H 20 scale factor I found a
>> few regressions,
>> Q21: 52 secs on HEAD and 400 secs with this patch
>
> Thanks Rafia. Robert just pointed out off-list that there is a bogus
> 0 row estimate in here:
>
> -> Parallel Hash Semi Join (cost=1006599.34..1719227.30 rows=0
> width=24) (actual time=38716.488..100933.250 rows=7315896 loops=5)
>
> Will investigate, thanks.

There are two problems here.

1. There is a pre-existing cardinality estimate problem for
semi-joins with <> filters. The big Q21 regression reported by Rafia
is caused by that phenomenon, probably exacerbated by another bug that
allowed 0 cardinality estimates to percolate inside the planner.
Estimates have been clamped at or above 1.0 since her report by commit
1ea60ad6.

I started a new thread to discuss that because it's unrelated to this
patch, except insofar as it confuses the planner about Q21 (with or
without parallelism). Using one possible selectivity tweak suggested
by Tom Lane, I was able to measure significant speedups on otherwise
unpatched master:

https://www.postgresql.org/message-id/CAEepm%3D11BiYUkgXZNzMtYhXh4S3a9DwUP8O%2BF2_ZPeGzzJFPbw%40mail.gmail.com

2. If you compare master tweaked as above against the latest version
of my patch series with the tweak, then the patched version always
runs faster with 4 or more workers, but with only 1 or 2 workers Q21
is a bit slower... but not always. I realised that there was a
bi-modal distribution of execution times. It looks like my 'early
exit' protocol, designed to make tuple-queue deadlock impossible, is
often causing us to lose a worker. I am working on that now.

I have code changes for Peter G's and Andres's feedback queued up and
will send a v8 series shortly, hopefully with a fix for problem 2
above.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2017-03-18 05:37:36 more on comments of snapbuild.c
Previous Message Robert Haas 2017-03-18 00:10:02 Re: Partition-wise join for join between (declaratively) partitioned tables