Re: pgsql: Add parallel-aware hash joins.

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-committers <pgsql-committers(at)postgresql(dot)org>
Subject: Re: pgsql: Add parallel-aware hash joins.
Date: 2018-01-22 10:17:47
Message-ID: CAEepm=2rZ6NzdR7Rk5jsXsJR8M_kRmi+0JSY1x6DLSdmBb-=VA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Thu, Dec 28, 2017 at 5:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
>> On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Aside from the instability problems, I'm pretty unhappy about how much
>>> the PHJ patch has added to the runtime of "make check". I do not think
>>> any one feature can justify adding 20% to that. Can't you cut down the
>>> amount of data processed by these new test cases?
>
>> Isn't that mostly because of the CV livelock problem?
>
> Hm, maybe. I quoted the 20% figure on the basis of longfin's reports,
> not prairiedog's ... but it might be seeing some of the livelock effect
> too.
>
>> So without the effects of that bug it's only taking 2.4% longer than
>> commit fa330f9a. Is that acceptable for a feature of this size and
>> complexity? I will also look into making the data sets smaller.
>
> That sounds better, but it's still worth asking whether the tests
> could be quicker.

Here is a patch that halves the size of the test tables used. I don't
want them to be too small because I want some some real parallel
processing at least sometimes. On my slowest system with assertions
enabled this brings "time make check" from ~37.5s to ~36.9s. I'm a
bit worried about choosing parameters that lead to instability across
the buildfarm, until I make the size estimation code a bit smarter[1],
so I tested a few variations that affect the alignment and size of
things (-m32, -malign-double, --disable-atomics, --disable-spinlocks)
and didn't see failures. Does this pass repeatedly on gaur?

Since you mentioned longfin, I checked some historical "check"
runtimes and they seemed pretty erratic so I decided to bust out
gnuplot to try to see what's happening. See attached, which goes back
as far as I could web-scrape from the buildfarm. Three interesting
commits are shown with vertical lines, in this order from left to
right:

Commit fa330f9a "Add some regression tests that exercise hash join
code.", Nov 29.
Commit 18042840 "Add parallel-aware hash joins.", Dec 20
Commit aced5a92 "Rewrite ConditionVariableBroadcast() to avoid
live-lock.", Jan 5.

For comparison I also plotted data for two other machines whose times
are in a similar range. It looks to me like longfin's average and
variation increased half way between fa330f9a and 18042840, somewhere
near Dec 7 to 9, when we went from ~40s +/- 1 to ~50s with several
seconds' variation. Was there some other environmental change then on
that machine? Do you see any commit that could explain that -- maybe
0a3edbb33? Then maybe it also suffered a bit from livelock woes from
18042840 until aced5a92? It's hard to say and I'm not really sure what
to make of this, but it seems clear that you could draw the wrong
conclusion if you just picked a couple of random data points from that
animal, and that those other animals aren't affected too much by any
of these commits.

[1] https://wiki.postgresql.org/wiki/Parallel_Hash

Attachment Content-Type Size
join-test-shrink.patch application/octet-stream 13.7 KB
image/png 9.6 KB

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Thomas Munro 2018-01-22 11:47:17 Re: pgsql: Add parallel-aware hash joins.
Previous Message Magnus Hagander 2018-01-22 09:18:52 pgsql: Fix docs typo

Browse pgsql-hackers by date

  From Date Subject
Next Message Maksim Milyutin 2018-01-22 11:34:58 Re: [HACKERS] PoC: custom signal handler for extensions
Previous Message Daniel Gustafsson 2018-01-22 10:07:55 Re: Handling better supported channel binding types for SSL implementations