Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-03-21 16:10:17
Message-ID: CA+TgmoaumQLs3LT-Lwx2JrQUUNGF5VHnD7zG=zzUMf_2J1DfbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 19, 2017 at 9:03 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I attach my V9 of the patch. I came up some stuff for the design of
>> resource management that I think meets every design goal that we have
>> for shared/unified BufFiles:
>
> Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should
> now pass -1 as the index block argument to compute_parallel_worker(),
> just as all callers that aren't parallel index scan do after that
> commit. This issue caused V9 to never choose parallel CREATE INDEX
> within nbtsort.c. There was also a small amount of bitrot.
>
> Attached V10 fixes this regression. I also couldn't resist adding a
> few new assertions that I thought were worth having to buffile.c, plus
> dedicated wait events for parallel tuplesort. And, I fixed a silly bug
> added in V9 around where worker_wait() should occur.

Some initial review comments:

- * This code is moderately slow (~10% slower) compared to the regular
- * btree (insertion) build code on sorted or well-clustered data. On
- * random data, however, the insertion build code is unusable -- the
- * difference on a 60MB heap is a factor of 15 because the random
- * probes into the btree thrash the buffer pool. (NOTE: the above
- * "10%" estimate is probably obsolete, since it refers to an old and
- * not very good external sort implementation that used to exist in
- * this module. tuplesort.c is almost certainly faster.)

While I agree that the old comment is probably inaccurate, I don't
think dropping it without comment in a patch to implement parallel
sorting is the way to go. How about updating it to be more current as
a separate patch?

+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_BTREE_SHARED UINT64CONST(0xA000000000000001)
+#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
+#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)

1, 2, and 3 would probably work just as well. The parallel
infrastructure uses high-numbered values to avoid conflict with
plan_node_id values, but this is a utility statement so there's no
such problem. But it doesn't matter very much.

+ * Note: caller had better already hold some type of lock on the table and
+ * index.
+ */
+int
+plan_create_index_workers(Oid tableOid, Oid indexOid)

Caller should pass down the Relation rather than the Oid. That is
better both because it avoids unnecessary work and because it more or
less automatically avoids the problem mentioned in the note.

Why is this being put in planner.c rather than something specific to
creating indexes? Not sure that's a good idea.

+ * This should be called when workers have flushed out temp file buffers and
+ * yielded control to caller's process. Workers should hold open their
+ * BufFiles at least until the caller's process is able to call here and
+ * assume ownership of BufFile. The general pattern is that workers make
+ * available data from their temp files to one nominated process; there is
+ * no support for workers that want to read back data from their original
+ * BufFiles following writes performed by the caller, or any other
+ * synchronization beyond what is implied by caller contract. All
+ * communication occurs in one direction. All output is made available to
+ * caller's process exactly once by workers, following call made here at the
+ * tail end of processing.

Thomas has designed a system for sharing files among cooperating
processes that lacks several of these restrictions. With his system,
it's still necessary for all data to be written and flushed by the
writer before anybody tries to read it. But the restriction that the
worker has to hold its BufFile open until the leader can assume
ownership goes away. That's a good thing; it avoids the need for
workers to sit around waiting for the leader to assume ownership of a
resource instead of going away faster and freeing up worker slots for
some other query, or moving on to some other computation. The
restriction that the worker can't reread the data after handing off
the file also goes away. The files can be read and written by any
participant in any order, as many times as you like, with only the
restriction that the caller must guarantee that data will be written
and flushed from private buffers before it can be read. I don't see
any reason to commit both his system and your system, and his is more
general so I think you should use it. That would cut hundreds of
lines from this patch with no real disadvantage that I can see --
including things like worker_wait(), which are only needed because of
the shortcomings of the underlying mechanism.

+ * run. Parallel workers always use quicksort, however.

Comment fails to mention a reason.

+ elog(LOG, "%d using " INT64_FORMAT " KB of memory for read
buffers among %d input tapes",
+ state->worker, state->availMem / 1024, numInputTapes);

I think "worker %d" or "participant %d" would be a lot better than
just starting the message with "%d". (There are multiple instances of
this, with various messages.)

I think some of the smaller changes that this patch makes, like
extending the parallel context machinery to support SnapshotAny, could
be usefully broken out as separately-committable patches.

I haven't really dug down into the details here, but with the
exception of the buffile.c stuff which I don't like, the overall
design of this seems pretty sensible to me. We might eventually want
to do something more clever at the sorting level, but those changes
would be confined to tuplesort.c, and all the other changes you've
introduced here would stand on their own. Which is to say that even
if there's more win to be had here, this is a good start.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-03-21 16:13:19 Re: identity columns
Previous Message Andres Freund 2017-03-21 16:10:06 Re: Patch: Write Amplification Reduction Method (WARM)