Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-16 00:54:40
Message-ID: CAH2-Wz=rRMRGTwNZ0qAn7jwZ5pp0HDFsJ5WdKpr+5gS=0FY03w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> More comments:

Attached patch has all open issues worked through, including those
that I respond to or comment on below, as well as the other feedback
from your previous e-mails. Note also that I fixed the issue that Amit
raised, as well as the GetOldestXmin()-argument bug that I noticed in
passing when responding to Amit. I also worked on the attribution in
the commit message.

Before getting to my responses to your most recent round of feedback,
I want to first talk about some refactoring that I decided to do. As
you can see from the master branch, tuplesort_performsort() isn't
necessarily reached for spool2, even when we start out with a spool2
(that is, for many unique index builds, spool2 never even does a
tuplesort_performsort()). We may instead decide to shut down spool2
when it has no (dead) tuples. I made this work just as well for the
parallel case in this latest revision. I had to teach tuplesort.c to
accept an early tuplesort_end() for LEADER() -- it had to be prepared
to release still-waiting workers in some cases, rather than depending
on nbtsort.c having called tuplesort_performsort() already. Several
routines within nbtsort.c that previously knew something about
parallelism now know nothing about it. This seems like a nice win.

Separately, I took advantage of the fact that within the leader, its
*worker* Tuplesortstate can safely call tuplesort_end() before the
leader state's tuplesort_performsort() call.

The overall effect of these two changes is that there is now a
_bt_leader_heapscan() call for the parallel case that nicely mirrors
the serial case's IndexBuildHeapScan() call, and once we're done with
populating spools, no subsequent code needs to know a single thing
about parallelism as a special case. You may notice some small changes
to the tuplesort.h overview, which now advertises that callers can
take advantage of this leeway.

Now on to my responses to your most recent round of feeback...

> BufFileView() looks fairly pointless. It basically creates a copy of
> the input and, in so doing, destroys the input, which is a lot like
> returning the input parameter except that it uses more cycles. It
> does do a few things.

While it certainly did occur to me that that was kind of weird, and I
struggled with it on my own for a little while, I ultimately agreed
with Thomas that it added something to have ltsConcatWorkerTapes()
call some buffile function in every iteration of its loop.
(BufFileView() + BufFileViewAppend() are code that Thomas actually
wrote, though I added the asserts and comments myself.)

If you think about this in terms of the interface rather than the
implementation, then it may make more sense. The encapsulation adds
something which might pay off later, such as when extendBufFile()
needs to work with a concatenated set of BufFiles. And even right now,
I cannot simply reuse the BufFile without then losing the assert that
is currently in BufFileViewAppend() (must not have associated shared
fileset assert). So I'd end up asserting less (rather than more) there
if BufFileView() was removed.

It wastes some cycles to not simply use the BufFile directly, but not
terribly many in the grand scheme of things. This happens once per
external sort operation.

> In miscadmin.h, I'd put the prototype for the new GUC next to
> max_worker_processes, not maintenance_work_mem.

But then I'd really have to put it next to max_worker_processes in
globals.c, too. That would mean that it would go under "Primary
determinants of sizes of shared-memory structures" within globals.c,
which seems wrong to me. What do you think?

> The ereport() in index_build will, I think, confuse people when it
> says that there are 0 parallel workers. I suggest splitting this into
> two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
> "building index \"%s\" on table \"%s\" serially" ...) else ereport(...
> "building index \"%s\" on table \"%s\" in parallel with request for %d
> parallel workers" ...).

WFM. I've simply dropped any reference to leader participation in the
messages here, to keep things simple. This seemed okay because the
only thing that affects leader participation is the
parallel_leader_participation GUC, which is under the user's direct
control at all times, and is unlikely to be changed. Those that really
want further detail have trace_sort for that.

> The logic in IndexBuildHeapRangeScan() around need_register_snapshot
> and OldestXmin seems convoluted and not very well-edited to me.

Having revisited it, I now agree that the code added to
IndexBuildHeapRangeScan() was unclear, primarily in that the
need_unregister_snapshot local variable was overloaded in a weird way.

> I suggest that you go back to the original code organization
> and then just insert an additional case for a caller-supplied scan, so
> that the overall flow looks like this:
>
> if (scan != NULL)
> {
> ...
> }
> else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
> {
> ...
> }
> else
> {
> ...
> }

The problem that I see with this alternative flow is that the "if
(scan != NULL)" and the "else if (IsBootstrapProcessingMode() ||
indexInfo->ii_Concurrent)" blocks clearly must contain code for two
distinct, non-overlapping cases, despite the fact those two cases
actually do overlap somewhat. That is, a call to
IndexBuildHeapRangeScan() may have a (parallel) heap scan argument
(control reaches your first code block), or may not (control reaches
your second or third code block). At the same time, a call to
IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX),
or may need an MVCC snapshot (either by registering its own, or using
the parallel one). These two things are orthogonal.

I think I still get the gist of what you're saying, though. I've come
up with a new structure that is a noticeable improvement on what I
had. Importantly, the new structure let me add a number of
parallelism-agnostic asserts that make sure that every ambuild routine
that supports parallelism gets the details right.

> Along with that, I'd change the name of need_register_snapshot to
> need_unregister_snapshot (it's doing both jobs right now) and
> initialize it to false.

Done.

> + * This support code isn't reliable when called from within a parallel
> + * worker process due to the fact that our state isn't propagated. This is
> + * why parallel index builds are disallowed on catalogs. It is possible
> + * that we'll fail to catch an attempted use of a user index undergoing
> + * reindexing due the non-propagation of this state to workers, which is not
> + * ideal, but the problem is not particularly likely to go undetected due to
> + * our not doing better there.
>
> I understand the first two sentences, but I have no idea what the
> third one means, especially the part that says "not particularly
> likely to go undetected due to our not doing better there". It sounds
> scary that something bad is only "not particularly likely to go
> undetected"; don't we need to detect bad things reliably?

The primary point here, that you said you understood, is that we
definitely need to detect when we're reindexing a catalog index within
the backend, so that systable_beginscan() can do the right thing and
not use the index (we also must avoid assertion failures). My solution
to that problem is, of course, to not allow the use of parallel create
index when REINDEXing a system catalog. That seems 100% fine to me.

There is a little bit of ambiguity about other cases, though -- that's
the secondary point I tried to make within that comment block, and the
part that you took issue with. To put this secondary point another
way: It's possible that we'd fail to detect it if someone's comparator
went bananas and decided it was okay to do SQL access (that resulted
in an index scan of the index undergoing reindex). That does seem
rather unlikely, but I felt it necessary to say something like this
because ReindexIsProcessingIndex() isn't already something that only
deals with catalog indexes -- it works with all indexes.

Anyway, I reworded this. I hope that what I came up with is clearer than before.

> But also,
> you used the word "not" three times and also the prefix "un-", meaning
> "not", once. Four negations in 13 words! Perhaps I'm not entirely in
> a position to cast aspersions on overly-complex phraseology -- the pot
> calling the kettle black and all that -- but I bet that will be a lot
> clearer if you reduce the number of negations to either 0 or 1.

You're not wrong. Simplified.

> The comment change in standard_planner() doesn't look helpful to me;
> I'd leave it out.

Okay.

> + * tableOid is the table that index is to be built on. indexOid is the OID
> + * of a index to be created or reindexed (which must be a btree index).
>
> I'd rewrite that first sentence to end "the table on which the index
> is to be built". The second sentence should say "an index" rather
> than "a index".

Okay.

> But, actually, I think we would be better off just ripping
> leaderWorker/leaderParticipates out of this function altogether.
> compute_parallel_worker() is not really under any illusion that it's
> computing a number of *participants*; it's just computing a number of
> *workers*.

That distinction does seem to cause plenty of confusion. While I
accept what you say about compute_parallel_worker(), I still haven't
gone as far as removing the leaderParticipates argument altogether,
because compute_parallel_worker() isn't the only thing that matters
here. (More on that below.)

> I think it's fine for
> parallel_leader_participation=off to simply mean that you get one
> fewer participants. That's actually what would happen with parallel
> query, too. Parallel query would consider
> parallel_leader_participation later, in get_parallel_divisor(), when
> working out the cost of one path vs. another, but it doesn't use it to
> choose the number of workers. So it seems to me that getting rid of
> all of the workerLeader considerations will make it both simpler and
> more consistent with what we do for queries.

I was aware of those details, and figured that parallel query fudges
the compute_parallel_worker() figure's leader participation in some
sense, and that that was what I needed to compensate for. After all,
when parallel_leader_participation=off, having
compute_parallel_worker() return 1 means rather a different thing to
what it means with parallel_leader_participation=on, even though in
general we seem to assume that parallel_leader_participation can only
make a small difference overall.

Here's what I've done based on your feedback: I've changed the header
comments, but stopped leaderParticipates from affecting the
compute_parallel_worker() calculation (so, as I said,
leaderParticipates stays). The leaderParticipates argument continues
to affect these two aspects of plan_create_index_workers()'s return
value:

1. It continues to be used so we have a total number of participants
(not workers) to apply our must-have-32MB-workMem limit on
participants.

Parallel query has no equivalent of this, and it seems warranted. Note
that this limit is no longer applied when parallel_workers storage
param was set, as discussed.

2. I continue to use the leaderParticipates argument to disallow the
case where there is only one CREATE INDEX participant but parallelism
is in use, because, of course, that clearly makes no sense -- we
should just use a serial sort instead.

(It might make sense to allow this if parallel_leader_participation
was *purely* a testing GUC, only for use by by backend hackers, but
AFAICT it isn't.)

The planner can allow a single participant parallel sequential scan
path to be created without worrying about the fact that that doesn't
make much sense, because a plan with only one parallel participant is
always going to cost more than some serial plan (you will only get a 1
participant parallel sequential scan when force_parallel_mode is on).
Obviously plan_create_index_workers() doesn't generate (partial) paths
at all, so I simply have to get the same outcome (avoiding a senseless
1 participant parallel operation) some other way here.

> If you have an idea how to make a better
> cost model than this for CREATE INDEX, I'm willing to consider other
> options. If you don't, or want to propose that as a follow-up patch,
> then I think it's OK to use what you've got here for starters. I just
> don't want it to be more baroque than necessary.

I suspect that the parameters of any cost model for parallel CREATE
INDEX that we're prepared to consider for v11 are: "Use a number of
parallel workers that is one below the number at which the total
duration of the CREATE INDEX either stays the same or goes up".

It's hard to do much better than this within those parameters. I can
see a fairly noticeable benefit to parallelism with 4 parallel workers
and a measly 1MB of maintenance_work_mem (when parallelism is forced)
relative to the serial case with the same amount of memory. At least
on my laptop, it seems to be rather hard to lose relative to a serial
sort when using parallel CREATE INDEX (to be fair, I'm probably
actually using way more memory than 1MB to do this due to FS cache
usage). I can think of a cleverer approach to costing parallel CREATE
INDEX, but it's only cleverer by weighing distributed costs. Not very
relevant, for the time being.

BTW, the 32MB per participant limit within plan_create_index_workers()
was chosen based on the realization that any higher value would make
having a default setting of 2 for max_parallel_maintenance_workers (to
match the max_parallel_workers_per_gather default) pointless when the
default maintenance_work_mem value of 64MB is in use. That's not
terribly scientific, though it at least doesn't come at the expense of
a more scientific idea for a limit like that (I don't actually have
one, you see). I am merely trying to avoid being *gratuitously*
wasteful of shared resources that are difficult to accurately cost in
(e.g., the distributed cost of random I/O to the system as a whole
when we do a parallel index build while ridiculously low on
maintenance_work_mem).

> I think that the naming of the wait events could be improved. Right
> now, they are named by which kind of process does the waiting, but it
> really should be named based on what the thing for which we're
> waiting. I also suggest that we could just write Sort instead of
> Tuplesort. In short, I suggest ParallelTuplesortLeader ->
> ParallelSortWorkersDone and ParallelTuplesortLeader ->
> ParallelSortTapeHandover.

WFM. Also added documentation for the wait events to monitoring.sgml,
which I somehow missed the first time around.

> Not for this patch, but I wonder if it might be a worthwhile future
> optimization to allow workers to return multiple tapes to the leader.
> One doesn't want to go crazy with this, of course. If the worker
> returns 100 tapes, then the leader might get stuck doing multiple
> merge passes, which would be a foolish way to divide up the labor, and
> even if that doesn't happen, Amdahl's law argues for minimizing the
> amount of work that is not done in parallel. Still, what if a worker
> (perhaps after merging) ends up with 2 or 3 tapes? Is it really worth
> merging them so that the leader can do a 5-way merge instead of a
> 15-way merge?

I did think about this myself, or rather I thought specifically about
building a serial/bigserial PK during pg_restore, a case that must be
very common. The worker merges for such an index build will typically
be *completely pointless* when all input runs are in sorted order,
because the merge heap will only need to consult the root of the heap
and its two immediate children throughout (commit 24598337c helped
cases like this enormously). You might as well merge hundreds of runs
in the leader, provided you still have enough memory per tape that you
can get the full benefit of OS readahead (this is not that hard when
you're only going to repeatedly read from the same tape anyway).

I'm not too worried about it, though. The overall picture is still
very positive even in this case. The "extra worker merging" isn't
generally a big proportion of the overall cost, especially there. More
importantly, if I tried to do better, it would be the "quicksort with
spillover" cost model story all over again (remember how tedious that
was?). How hard are we prepared to work to ensure that we get it right
when it comes to skipping worker merging, given that users always pay
some overhead, even when that doesn't happen?

Note also that parallel index builds manage to unfairly *gain*
advantage over serial cases (they have the good variety of dumb luck,
rather than the bad variety) in certain other common cases. This
happens with an *inverse* physical/logical correlation (e.g. a DESC
index builds on a date field). They manage to artificially do better
than theory would predict, simply because a greater number of smaller
quicksorts are much faster during initial run generation, without also
taking a concomitant performance hit at merge time. Thomas showed this
at one point. Note that even that's only true because of the qsort
precheck (what I like to call the "banana skin prone" precheck, that
we added to our qsort implementation in 2006) -- it would be true for
*all* correlations, but that one precheck thing complicates matters.

All of this is a tricky business, and that isn't going to get any easier IMV.

> + * Make sure that the temp file(s) underlying the tape set are created in
> + * suitable temp tablespaces. This is only really needed for serial
> + * sorts.
>
> This comment makes me wonder whether it is "sorta" needed for parallel sorts.

I removed "really". The point of the comment is that we've already set
up temp tablespaces for the shared fileset in the parallel case.
Shared filesets figure out which tablespaces will be used up-front --
see SharedFileSetInit().

> - if (trace_sort)
> + if (trace_sort && !WORKER(state))
>
> I have a feeling we still want to get this output even from workers,
> but maybe I'm missing something.

I updated tuplesort_end() so that trace_sort reports on the end of the
sort, even for worker processes. (We still don't show generic
tuplesort_begin* message for workers, though.)

> + arg5 indicates serial, parallel worker, or parallel leader sort.</entry>
>
> I think it should say what values are used for each case.

I based this on "arg0 indicates heap, index or datum sort", where it's
implied that the values are respective to the order that they appear
in in the sentence (starting from 0). But okay, I'll do it that way
all the same.

> + /* Release worker tuplesorts within leader process as soon as possible */
>
> IIUC, the worker tuplesorts aren't really holding onto much of
> anything in terms of resources. I think it might be better to phrase
> this as /* The sort we just did absorbed the final tapes produced by
> these tuplesorts, which are of no further use. */ or words to that
> effect.

Okay. Done that way.

> Instead of making a special case in CreateParallelContext for
> serializable_okay, maybe index_build should just use SetConfigOption()
> to force the isolation level to READ COMMITTED right after it does
> NewGUCNestLevel(). The change would only be temporary because the
> subsequent call to AtEOXact_GUC() will revert it.

I tried doing it that way, but it doesn't seem workable:

postgres=# begin transaction isolation level serializable ;
BEGIN
postgres=*# reindex index test_unique;
ERROR: 25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
LOCATION: call_string_check_hook, guc.c:9953

Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify
default_transaction_isolation -- not transaction_isolation.

Instead, I added a bit more to comments within
CreateParallelContext(), to justify what I've done along the lines you
went into. Hopefully this works better for you.

> There is *still* more to review here, but my concentration is fading.
> If you could post an updated patch after adjusting for the comments
> above, I think that would be helpful. I'm not totally out of things
> to review that I haven't already looked over once, but I think I'm
> close.

I'm impressed with how quickly you're getting through review of the
patch. Hopefully we can keep that momentum up.

Thanks
--
Peter Geoghegan

Attachment Content-Type Size
0001-Add-parallel-B-tree-index-build-sorting.patch text/x-patch 156.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2018-01-16 01:40:43 Re: [HACKERS] Replication status in logical replication
Previous Message Andrew Dunstan 2018-01-16 00:24:06 Re: jsonpath