Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-01-03 23:53:59
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Tue, Dec 20, 2016 at 5:14 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> Imagine a data structure that is stored in dynamic shared memory and
>> contains space for a filename, a reference count, and a mutex. Let's
>> call this thing a SharedTemporaryFile or something like that. It
>> offers these APIs:
>> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
>> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
>> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
>> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);
> I'm a little bit tired right now, and I have yet to look at Thomas'
> parallel hash join patch in any detail. I'm interested in what you
> have to say here, but I think that I need to learn more about its
> requirements in order to have an informed opinion.

Attached is V7 of the patch. The overall emphasis with this revision
is on bringing clarity on how much can be accomplished using
generalized infrastructure, explaining the unification mechanism
coherently, and related issues.

Notable changes

* Rebased to work with the newly simplified logtape.c representation
(the recent removal of "indirect blocks" by Heikki). Heikki's work was
something that helped with simplifying the whole unification
mechanism, to a significant degree. I think that there was over a 50%
reduction in logtape.c lines of code in this revision.

* randomAccess cases are now able to reclaim disk space from blocks
originally written by workers. This further simplifies logtape.c
changes significantly. I don't think that this is important because
some future randomAccess caller might otherwise have double the
storage overhead for their parallel sort, or even because of the
disproportionate performance penalty such a caller would experience;
rather, it's important because it removes previous special cases (that
were internal to logtape.c).

For example, aside from the fact that worker tapes within a unified
tapeset will often have a non-zero offset, there is no state that
actually remembers that this is a unified tapeset, because that isn't
needed anymore. And, even though we reclaim blocks from workers, we
only have one central chokepoint for applying worker offsets in the
leader (that chokepoint is ltsReadFillBuffer()). Routines tasked with
things like positional seeking for mark/restore for certain tuplesort
clients (which are. in general, poorly tested) now need to have no
knowledge of unification while still working just the same. This is a
consequence of the fact that ltsWriteBlock() callers (and
ltsWriteBlock() itself) never have to think about offsets. I'm pretty
happy about that.

* pg_restore now prevents the planner from deciding that parallelism
should be used, in order to make restoration behavior more consistent
and predictable. Iff a dump being restored happens to have a CREATE
INDEX with the new index storage parameter parallel_workers set, then
pg_restore will use parallel CREATE INDEX. This is accomplished with a
new GUC, enable_parallelddl (since "max_parallel_workers_maintenance =
0" will disable parallel CREATE INDEX across the board, ISTM that a
second new GUC is required). I think that this behavior the right
trade-off for pg_restore goes, although I still don't feel
particularly strongly about it. There is now a concrete proposal on
what to do about pg_restore, if nothing else. To recap, the general
concern address here is that there are typically no ANALYZE stats
available for the planner to decide with when pg_restore runs CREATE
INDEX, although that isn't always true, which was both surprising and

* Addresses the problem of anonymous record types and their need for
"remapping" across parallel workers. I've simply pushed the
responsibility on callers within tuplesort.h contract; parallel CREATE
INDEX callers don't need to care about this, as explained there.
(CLUSTER tuplesorts would also be safe.)

* Puts the whole rationale for unification into one large comment
above the function BufFileUnify(), and removes traces of the same kind
of discussion from everywhere else. I think that buffile.c is the
right central place to discuss the unification mechanism, now that
logtape.c has been greatly simplified. All the fd.c changes are in
routines that are only ever called by buffile.c anyway, and are not
too complicated (in general, temp fd.c files are only ever owned
transitively, through BufFiles). So, morally, the unification
mechanism is something that wholly belongs to buffile.c, since
unification is all about temp files, and buffile.h is the interface
through which temp files are owned and accessed in general, without

Unification remains specialized

On the one hand, BufFileUnify() now describes the whole idea of
unification in detail, in its own general terms, including its
performance characteristics, but on the other hand it doesn't pretend
to be more general than it is (that's why we really have to talk about
performance characteristics). It doesn't go as far as admitting to
being the thing that logtape.c uses for parallel sort, but even that
doesn't seem totally unreasonable to me. I think that BufFileUnify()
might also end up being used by tuplestore.c, so it isn't entirely
non-general, but I now realize that it's unlikely to be used by
parallel hash join. So, while randomAccess reclamation of worker
blocks within the leader now occurs, I have not followed Robert's
suggestion in full. For example, I didn't do this: "ltsGetFreeBlock()
need to be re-engineered to handle concurrency with other backends".
The more I've thought about it, the more appropriate the kind of
specialization I've come up with seems. I've concluded:

- Sorting is important, and therefore worth adding non-general
infrastructure in support of. It's important enough to have its own
logtape.c module, so why not this? Much of buffile.c was explicitly
written with sorting and hashing in mind from the beginning. We use
BufFiles for other things, but those two things are by far the two
most important users of temp files, and the only really compelling
candidates for parallelization.

- There are limited opportunities to share BufFile infrastructure for
parallel sorting and parallel hashing. Hashing is inverse to sorting
conceptually, so it should not be surprising that this is the case. By
that I mean that hashing is characterized by logical division and
physical combination, whereas sorting is characterized by physical
division and logical combination. Parallel tuplesort naturally allows
each worker to do an enormous amount of work with whatever data it is
fed by the parallel heap scan that it joins, *long* before the data
needs to be combined with data from other workers in any way.

Consider this code from Thomas' parallel hash join patch:

> +bool
> +ExecHashCheckForEarlyExit(HashJoinTable hashtable)
> +{
> + /*
> + * The golden rule of leader deadlock avoidance: since leader processes
> + * have two separate roles, namely reading from worker queues AND executing
> + * the same plan as workers, we must never allow a leader to wait for
> + * workers if there is any possibility those workers have emitted tuples.
> + * Otherwise we could get into a situation where a worker fills up its
> + * output tuple queue and begins waiting for the leader to read, while
> + * the leader is busy waiting for the worker.
> + *
> + * Parallel hash joins with shared tables are inherently susceptible to
> + * such deadlocks because there are points at which all participants must
> + * wait (you can't start check for unmatched tuples in the hash table until
> + * probing has completed in all workers, etc).

Parallel sort will never have to do anything like this. There is
minimal IPC before the leader's merge, and the dependencies between
phases are extremely simple (there is only one; workers need to finish
before leader can merge, and must stick around in a quiescent state
throughout). Data throughput is what tuplesort cares about; it doesn't
really care about latency. Whereas, I gather that there needs to be
continual gossip between hash join workers (those building a hash
table) about the number of batches. They don't have to be in perfect
lockstep, but they need to cooperate closely; the IPC is pretty eager,
and therefore latency sensitive. Thomas makes use of atomic ops in his
patch, which makes sense, but I'd never bother with anything like that
for parallel tuplesort; there'd be no measurable benefit there.

In general, it's not obvious to me that the SharedTemporaryFile() API
that Robert sketched recently (or any very general shared file
interface that does things like buffer writes in shared memory, uses a
shared read pointer, etc) is right for either parallel hash join or
parallel sort. I don't see that there is much to be said for a
reference count mechanism for parallel sort BufFiles, since the
dependencies are so simple and fixed, and for hash join, a much
tighter mechanism seems desirable. I can't think why Thomas would want
a shared read pointer, since the way he builds the shared hash table
leaves it immutable once probing is underway; ISTM that he'll want
that kind of mechanism to operate at a higher level, in a more
specialized way.

That said, I don't actually know what Thomas has in mind for
multi-batch parallel hash joins, since that's only a TODO item in the
most recent revision of his patch (maybe I missed something he wrote
on this topic, though). Thomas is working on a revision that resolves
that open item, at which point we'll know more. I understand that a
new revision of his patch that closes out the TODO item isn't too far
from being posted.

Peter Geoghegan

Attachment Content-Type Size
0002-Add-temporary-testing-tools.patch.gz application/x-gzip 4.7 KB
0001-Add-parallel-B-tree-index-build-sorting.patch.gz application/x-gzip 57.8 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-01-04 00:21:36 Re: Unusable SP-GiST index
Previous Message Lewis, Ian (Microstar Laboratories) 2017-01-03 23:45:10 Re: Cluster wide option to control symbol case folding