Quick Links

Parallel tuplesort, partitioning, merging, and the future

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject:	Parallel tuplesort, partitioning, merging, and the future
Date:	2016-08-08 19:44:02
Message-ID:	CAM3SWZR+ATYAzyMT+hm-Bo=1L1smtJbNDtibwBTKtYqS0dYZVg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Over on the "Parallel tuplesort (for parallel B-Tree index creation)"
thread [1], there has been some discussion of merging vs.
partitioning. There is a concern about the fact the merge of the
tuplesort used to build a B-Tree is not itself parallelized. There is
a weak consensus that we'd be better off with partitioning rather than
merging anyway.

I've started this new thread to discuss where we want to be with
partitioning (and further parallel merging). In short: How can we
maximize the benefit of parallel tuplesort?

I've said or at least implied that I favor keeping the CREATE INDEX
merge serial, while pursuing partitioning (for other uses) at a later
date. Further parallelization is useful for parallel query much more
so than parallel CREATE INDEX.

To summarize, I favor not parallelizing merging because:

* The scalability of parallel CREATE INDEX as implemented is
apparently comparable to the scalability of similar features in other
systems [2] (systems with very mature implementations). ISTM that we
cannot really expect to do too much better than that.

* We already do some merging in parallel to get runs for the leader to
merge in serial, if and only if workers each produce more than one
initial run (not unlikely). That's still pretty effective, and ensures
that the final leader merge is cache efficient. I think that I'll
probably end up changing the patch to share maintenance_work_mem among
workers (not have it be per-worker), making it common for workers to
merge, while the leader merge ends up needed fairly few rounds of
preloading to merge the final output.

* Even if it helps a bit, parallel merging is not the future;
partitioning is (more on this later).

I don't think partitioning is urgent for CREATE INDEX, and may be
inappropriate for CREATE INDEX under any circumstances, because:

* Possible problems with parallel infrastructure and writes.

Can the parallel infrastructure be even made to write and WAL log an
index in parallel, too? Partitioning more or less forces this, at
least if you expect any benefit at all -- reading the concatenated
output files in a serial pass where the index is written seems likely
to hurt more than it helps. This is not much of an advantage when
you're likely to be I/O bound anyway, and when the tuple startup cost
*per worker* is not of particular concern.

* Unbalanced B-Trees (or the risk thereof).

This is the really big one for me. We can only write leaf pages out in
parallel; we have to "merge together sub B-Trees" to produce internal
pages that make up a finished B-Tree. That risks creating a set of
internal pages that reflect a skew that any partition-to-sort approach
happened to have, due to whatever problem might emerge in the choice
of separators in tuplesort.c (this is a documented issue with SQL
Server Enterprise Edition, in fact). Can we tolerate the risk of
ending up with an unbalanced final B-Tree in the event of a
misestimation? Seems like that's quite a lot worse than poor query
performance (due to poor load balancing among parallel workers). DBAs
are more or less used to the latter variety of problems. They are not
used to the former, especially because the problem might go unnoticed
for a long time.

* What I've come up with is minimally divergent from the existing
approach to tuplesorting.

This is useful because of the code complexity of having multiple
workers consuming final output (writing an index) in parallel looks to
be fairly high. Consider B-Tree related features that tuplesort must
care about today, which have further special considerations in a world
where this simple "consume all output at once in one process" model is
replaced for parallel CREATE INDEX. You'd get all kinds of subtle
problems at various boundaries in the final keyspace for things like
CREATE UNIQUE INDEX CONCURRENTLY. That seems like something that I'd
be quite happy to avoid worrying about (while still allowing parallel
CREATE UNIQUE INDEX CONCURRENTLY). Note that some other systems don't
allow the equivalent of parallel CREATE INDEX CONCURRENTLY at all,
presumably because of similar concerns. My patch supports CIC, and
does so almost automatically (although the second pass isn't performed
in parallel).

I bet that CREATE INDEX won't be the last case where that simplicity
and generality clearly matters.

Suggested partitioning algorithm
================================

I think a hybrid partitioning + merging approach would work well for
us. The paper "Parallel Sorting on a Shared-Nothing Architecture using
Probabilistic Splitting" [3] has influenced my thinking here (this was
written by prominent researchers from the influential UW-Madison
Wisconsin database group). Currently, I have in mind something that is
closer to what they call exact splitting to what they call
probabilistic splitting, because I don't think it's going to be
generally possible to have good statistics on partition boundaries
immediately available (e.g., through something like their
probabilistic splitting sampling the relation ahead of time).

The basic idea I have in mind is that we create runs in workers in the
same way that the parallel CREATE INDEX patch does (one output run per
worker). However, rather than merging in the leader, we use a
splitting algorithm to determine partition boundaries on-the-fly. The
logical tape stuff then does a series of binary searches to find those
exact split points within each worker's "final" tape. Each worker
reports the boundary points of its original materialized output run in
shared memory. Then, the leader instructs workers to "redistribute"
slices of their final runs among each other, by changing the tapeset
metadata to reflect that each worker has nworker input tapes with
redrawn offsets into a unified BufFile. Workers immediately begin
their own private on-the-fly merges.

What's really nice about this is that not long after the initial
generation of nworker runs, which has already been shown to be very
scalable, workers can each *individually* start returning output
tuples immediately (the first tuple can be returned by each on-the-fly
merge just after the drawing of boundaries). There is almost no serial
bottleneck that they block on. I think that this can be more important
than improving sort throughput per se.

There'd probably also be an interface for callers to tell tuplesorts
what their splitter boundaries are directly, and an interface to
retrieve details of what splitter boundaries a tuplesort has
determined for itself (in order to have two sort nodes with matching
boundaries, as with a parallel merge join).

Clearly it's really hard to be sure that this is the right thing at
this point, but my intuition is that this is the way to go (while
avoiding anything like this for CREATE INDEX). I'd like to know how
others feel about it.

[1] https://www.postgresql.org/message-id/flat/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ(at)mail(dot)gmail(dot)com#CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/CA%2BTgmoY5JYs4R1g_ZJ-P6SkULSb19xx4zUh7S8LJiXonCgVTuQ%40mail.gmail.com
[3] http://pages.cs.wisc.edu/~dewitt/includes/paralleldb/parsort.pdf
--
Peter Geoghegan

Responses

Re: Parallel tuplesort, partitioning, merging, and the future at 2016-08-10 18:59:00 from Robert Haas
Re: Parallel tuplesort, partitioning, merging, and the future at 2016-08-10 19:08:08 from Claudio Freire

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kevin Grittner	2016-08-08 20:13:15	Re: No longer possible to query catalogs for index capabilities?
Previous Message	Vladimir Sitnikov	2016-08-08 18:57:20	Re: Slowness of extended protocol