Re: WIP: [[Parallel] Shared] Hash

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-01-06 20:01:01
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Tue, Jan 3, 2017 at 10:53 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> I will post a new rebased version soon with that and
> some other nearby problems fixed.

Here is a new WIP patch. I have plenty of things to tidy up (see note
at end), but the main ideas are now pretty clear and I'd appreciate
some feedback. The main changes since the last patch, other than
debugging, are:

* the number of batches now increases if work_mem would be exceeded;
the work of 'shrinking' the hash table in memory in that case is done
in parallel

* work_mem accounting is done at chunk level, instead of tuples

* interlocking has been rethought

Previously, I had some ideas about using some lock free tricks for
managing chunks of memory, but you may be relieved to hear that I
abandoned those plans. Now, atomic ops are used only for one thing:
pushing tuples into the shared hash table buckets. An LWLock called
'chunk_lock' protects various linked lists of chunks of memory, and
also the shared work_mem accounting. The idea is that backends can
work independently on HASH_CHUNK_SIZE blocks of tuples at a time in
between needing to acquire that lock briefly. Also, there is now a
second barrier, used to coordinate hash table shrinking. This can
happen any number of times during PHJ_PHASE_HASHING and
PHJ_PHASE_LOADING_BATCH(n) phases as required to stay under work_mem,
so it needed to be a separate barrier.

The communication in this patch is a bit more complicated than other
nearby parallel query projects I've looked at; probably the worst bit
is the leader deadlock avoidance stuff (see
ExecHashCheckForEarlyExit), and the second worst bit is probably the
switch statements for allowing participants to show up late and get in
sync, which makes that other problem even more annoying; without those
problems and with just the right kind of reusable shared tuplestore,
this would be a vastly simpler patch. Those are not really
fundamental problems of parallel joins using a shared hash tables, but
they're problems I don't have a better solution to right now.

Stepping back a bit, I am aware of the following approaches to hash
join parallelism:

1. Run the inner plan and build a private hash table in each
participant, and then scatter the outer plan arbitrarily across
participants. This is what 9.6 does, and it's a good plan for small
hash tables with fast inner plans, but a terrible plan for expensive
or large inner plans. Communication overhead: zero; CPU overhead:
runs the inner plan in k workers simultaneously; memory overhead:
builds k copies of the hashtable; disk overhead: may need to spill k
copies of all batches to disk if work_mem exceeded; restrictions:
Can't do right/full joins because no shared 'matched' flags.

2. Run a partition-wise hash join[1]. Communication overhead: zero;
CPU overhead: zero; memory overhead: zero; disk overhead: zero;
restrictions: the schema must include compatible partition keys, and
potential parallelism is limited by the number of partitions.

3. Repartition the data on the fly, and then run a partition-wise
hash join. Communication overhead: every tuple on at least one and
possibly both sides must be rerouted to the correct participant; CPU
overhead: zero, once repartitioning is done; memory overhead: none;
disk overhead: may need to spill partitions to disk if work_mem is

4. Scatter both the inner and outer plans arbitrarily across
participants (ie uncorrelated partitioning), and build a shared hash
table. Communication overhead: synchronisation of build/probe phases,
but no tuple rerouting; CPU overhead: none; memory overhead: none;
disk overhead: may need to spill batches to disk; restrictions: none
in general, but currently we have to drop the leader after the first
batch of a multi-batch join due to our consumer/producer leader
problem mentioned in earlier messages.

We have 1. This proposal aims to provide 4. It seems we have 2 on
the way (that technique works for all 3 join algorithms without any
changes to the join operators and looks best by any measure, but is
limited by the user's schema, ie takes careful planning on the user's
part instead of potentially helping any join). Other databases
including SQL Server offer 3. I suspect that 4 is probably a better
fit than 3 for Postgres today, because the communication overhead of
shovelling nearly all tuples through extra tuple queues to route them
to the right hash table would surely be very high, though I can see
that it's very attractive to have a reusable tuple repartitioning
operator and then run k disjoint communication-free joins (again,
without code change to the join operator, and to the benefit of all
join operators).

About the shared batch reading code: this patch modifies BufFile so
that a temporary file can be shared read-only with other participants,
and then introduces a mechanism for coordinating shared reads. Each
worker starts out reading all the tuples from the file that it wrote,
before attempting to steal tuples from the files written by other
participants, until there are none left anywhere. In the best case
they all write out and then read back in just their own files with
minimal contention, and contention rises as tuples are less evenly
distributed among participants, but we never quite get the best case
because the leader always leaves behind a bunch of batches for the
others to deal with when it quits early. Maybe I should separate all
the batch reader stuff into another patch so it doesn't clutter the
hash join code up so much? I will start reviewing Parallel Tuplesort
shortly, which includes some related ideas.

Some assorted notes on the status: I need to do some thinking about
the file cleanup logic: both explicit deletes at the earliest possible
time, and failure/error paths. Currently the creator of each file is
responsible for cleaning it up, but I guess if the creator aborts
early the file disappears underneath the others' feet, and then I
guess they might raise a confusing error report that races against the
root cause error report; I'm looking into that. Rescans and skew
buckets not finished yet. The new chunk-queue based
ExecScanHashTableForUnmatched isn't tested yet (it replaces and
earlier version that was doing a bucket-by-bucket parallel scan).
There are several places where I haven't changed the private hash
table code to match the shared version because I'm not sure about
that, in particular the idea of chunk-based accounting (which happens
to be convenient for this code, but I also believe it to be more
correct). I'm still trying to decide how to report the hash table
tuple count and size: possibly the grand totals. Generally I need to
do some tidying and provide a suite of queries that hits interesting
cases. I hope to move on these things fairly quickly now that I've
got the hash table resizing and batch sharing stuff working (a puzzle
that kept me very busy for a while) though I'm taking a break for a
bit to do some reviewing.

The test query I've been looking at recently is TPCH Q9. With scale
1GB and work_mem = 64KB, I get a query plan that includes three
different variants of Hash node: Hash (run in every backend, duplicate
hash tables), Shared Hash (run in just one backend, but allowed to use
the sum of work_mem of all the backends, so usually wins by avoiding
batching), and Parallel Shared Hash (run in parallel and using sum of
work_mem). As an anecdatum, I see around 2.5x speedup against master,
using only 2 workers in both cases, though it seems to be bimodal,
either 2x or 2.8x, which I expect has something to do with that leader
exit stuff and I'm looking into that.. More on performance soon.

Thanks for reading!


Thomas Munro

Attachment Content-Type Size
parallel-hash-v3.patch application/octet-stream 157.2 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-01-06 20:02:26 Re: WIP: [[Parallel] Shared] Hash
Previous Message Alvaro Herrera 2017-01-06 19:44:17 Re: ALTER TABLE .. ALTER COLUMN .. ERROR: attribute .. has wrong type