Re: Parallel Hash take II

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Golovanov <rentech(at)mail(dot)ru>
Subject: Re: Parallel Hash take II
Date: 2017-07-31 23:55:45
Message-ID: CAEepm=3jB1NnDXKqYUi_FhoeTn5Usda4XFNphsw2V7bhvad5UQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 1, 2017 at 9:28 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2017-07-26 20:12:56 +1200, Thomas Munro wrote:
>> 2. Simplified costing. There is now just one control knob
>> "parallel_synchronization_cost", which I charge for each time the
>> participants will wait for each other at a barrier, to be set high
>> enough to dissuade the planner from using Parallel Hash for tiny hash
>> tables that would be faster in a parallel-oblivious hash join.
>> Earlier ideas about modelling the cost of shared memory access didn't
>> work out.
>
> Hm. You say, "didn't work out" - could you expand a bit on that? I'm
> quite doubtful that justaccounting for barriers will be good enough.

The earlier approach and some variants I played with were based on the
idea that we should try to estimate the cost of using shared memory.
But there's no precedent for costing the cache hierarchy beyond disk
vs memory, and it depends so much on your hardware (NUMA vs UMA) and
the data distribution. I have no doubt that variations in memory
access costs are important (for example, it was data distribution that
determined whether big-cache-oblivious-shared-hash-table or
MonetDB-style cache-aware approach won in that paper I've mentioned
here before[1]), but it seems like a hard problem and I didn't feel
like it was necessary. Or do you have a different idea here?

Another point is that in the earlier versions I was trying to teach
the planner how to choose among Hash, Shared Hash and Parallel Shared
Hash. The difference in costing between Hash and Shared Hash (one
worker builds, all workers probe) was important and sensitive, because
the only difference between them would be the cost of memory sharing.
When I dropped Shared Hash from the patch set, it no longer seemed
necessary to try to deal with such subtle costing, because Hash and
Parallel Hash (now without the word 'Shared') already have wildly
different costs: the latter is divided over N CPUs. So I felt I could
get away with a much blunter instrument: just something to avoid
parallel build overheads for tiny tables like TPC-H "nation".

I still wanted something that makes intuitive sense and that could be
set using experimental evidence though. Parallel_synchronization_cost
is an estimate of how long the average backend will have to wait for
the last backend to complete the phase and arrive at each barrier.
The most interesting case is the build phase: how long will the the
last backend make us wait before probing can begin? Well, that
depends on the parallel grain. Currently, the ultimate source of all
parallelism in our executor is Parallel Seq Scan and Parallel Index
Scan, and they hand out a page at a time. Of course, any number of
nodes may sit between the hash join and the scan, and one of them
might include a function that sleeps for 100 years in one backend or
performs a join that generates wildly different numbers of tuples in
each backend. I don't know what to do about that, other than to
assume we have perfectly spherical cows and reason on the basis of an
expected parallel grain reaching us from the scans.

One thing to note about parallel_synchronization_cost is that the cost
units, where 1 is traditionally the cost of scanning a page, actually
make *some* kind of sense here, though it's a bit tenuous: the last
worker to complete is the one that scans the final pages, while the
others see the scan finished. What's really wanted here is not simply
page scanning cost but rather a percentage of the total cost that
represents how much extra work the lanterne rouge of backends has to
do.

Two relevant projects here are:

1. David Rowley proposes changing the seq scan grain[2], perhaps
adaptively. I suppose as this number increases the time at which two
workers finish can vary more greatly.
2. The parallel-append project introduces a completely different type
of granularity based on unrelated and separately costed subplans
rather than pages. Perhaps there are things that could be done here
to model the fact that some workers might finish a long time before
others, but I don't know.

Perhaps what parallel hash really needs is not a user-controlled
parallel_synchronization_cost, but some number produced by the planner
to describe the expected distribution of tuple counts over workers.
Armed with something like that and the cost per tuple you might be
able to estimate how long we expect hash join barriers to make you
wait without introducing any new GUCs at all. I thought about some of
these things a bit but it seemed like a big research project of its
own and I was persuaded in an off-list discussion by Robert to try to
find the simplest thing that would avoid parallel-aware hash for
little tables that are already built very cheaply.

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.225.3495
[2] https://www.postgresql.org/message-id/CAKJS1f-XhfQ2-%3D85wgYo5b3WtEs%3Dys%3D2Rsq%3DNuvnmaV4ZsM1XQ%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Steve Singer 2017-08-01 00:09:52 Re: 10 beta docs: different replication solutions
Previous Message Peter Eisentraut 2017-07-31 23:42:30 Re: PostgreSQL 10 (latest beta) and older ICU