Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-18 03:22:10
Message-ID: CAH2-WznBA5DE-fyuMWxo1j1=r6+mDPtP2XHeQS6QOz=6fopvTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> While it certainly did occur to me that that was kind of weird, and I
>> struggled with it on my own for a little while, I ultimately agreed
>> with Thomas that it added something to have ltsConcatWorkerTapes()
>> call some buffile function in every iteration of its loop.
>> (BufFileView() + BufFileViewAppend() are code that Thomas actually
>> wrote, though I added the asserts and comments myself.)
>
> Hmm, well, if Thomas contributed code to this patch, then he needs to
> be listed as an author. I went searching for an email on this thread
> (or any other) where he posted code for this, thinking that there
> might be some discussion explaining the motivation, but I didn't find
> any. I'm still in favor of erasing this distinction.

I cleared this with Thomas recently, on this very thread, and got a +1
from him on not listing him as an author. Still, I have no problem
crediting Thomas as an author instead of a reviewer, even though
you're now asking me to remove what little code he actually authored.
The distinction between secondary author and reviewer is often
blurred, anyway.

Whether or not Thomas is formally a co-author is ambiguous, and not
something that I feel strongly about (there is no ambiguity about the
fact that he made a very useful contribution, though -- he certainly
did, both directly and indirectly). I already went out of my way to
ensure that Heikki receives a credit for parallel CREATE INDEX in the
v11 release notes, even though I don't think that there is any formal
rule requiring me to do so -- he *didn't* write even one line of code
in this patch. (That was just my take on another ambiguous question
about authorship.)

I suggest that we revisit this when you're just about to commit the
patch. Or you can just add his name -- I like to err on the side of
being inclusive.

>> If you think about this in terms of the interface rather than the
>> implementation, then it may make more sense. The encapsulation adds
>> something which might pay off later, such as when extendBufFile()
>> needs to work with a concatenated set of BufFiles. And even right now,
>> I cannot simply reuse the BufFile without then losing the assert that
>> is currently in BufFileViewAppend() (must not have associated shared
>> fileset assert). So I'd end up asserting less (rather than more) there
>> if BufFileView() was removed.
>
> I would see the encapsulation as having some value if the original
> BufFile remained valid and the new view were also valid. Then the
> BufFileView operation is a bit like a copy-on-write filesystem
> snapshot: you have the original, which you can do stuff with, and you
> have a copy, which can be manipulated independently, but the copying
> is cheap. But here the BufFile gobbles up the original so I don't see
> the point.

I'll see what I can do about this.

>> I think I still get the gist of what you're saying, though. I've come
>> up with a new structure that is a noticeable improvement on what I
>> had. Importantly, the new structure let me add a number of
>> parallelism-agnostic asserts that make sure that every ambuild routine
>> that supports parallelism gets the details right.
>
> Yes, that looks better. I'm slightly dubious that the new Asserts()
> are worthwhile, but I guess it's OK.

Bear in mind that the asserts basically amount to a check that the am
propagated indexInfo->ii_Concurrent correct within workers. It's nice
to be able to do this in a way that applies equally well to the serial
case.

> But I think it would be better
> to ditch the if-statement and do it like this:
>
> Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
> Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin)
> : !TransactionIdIsValid(OldestXmin));
> Assert(snapshot == SnapshotAny || !anyvisible);
>
> Also, I think you've got a little more than you need in terms of
> comments. I would keep the comments for the serial case and parallel
> case and drop the earlier one that basically says the same thing:

Okay.

>> (ReindexIsProcessingIndex() issue with non-catalog tables)
>
> I agree that it isn't particularly likely, but if somebody found it
> worthwhile to insert guards against those cases, maybe we should
> preserve them instead of abandoning them. It shouldn't be that hard
> to propagate those values from the leader to the workers. The main
> difficulty there seems to be that we're creating the parallel context
> in nbtsort.c, while the state that would need to be propagated is
> private to index.c, but there are several ways to solve that problem.
> It looks to me like the most robust approach would be to just make
> that part of what parallel.c naturally does. Patch for that attached.

If you think it's worth the cycles, then I have no objection. I will
point out that this means that everything that I say about
ReindexIsProcessingIndex() no longer applies, because the relevant
state will now be propagated. It doesn't need to be mentioned at all,
and I don't even need to forbid builds on catalogs.

Should I go ahead and restore builds on catalogs, and remove those
comments, on the assumption that your patch will be committed before
mine? Obviously parallel index builds on catalogs don't matter. OTOH,
why not? Perhaps it's like the debate around HOT that took place over
10 years ago, where Tom insisted that HOT work with catalogs on
general principle.

>> (It might make sense to allow this if parallel_leader_participation
>> was *purely* a testing GUC, only for use by by backend hackers, but
>> AFAICT it isn't.)
>
> As applied to parallel CREATE INDEX, it pretty much is just a testing
> GUC, which is why I was skeptical about leaving support for it in the
> patch. There's no anticipated advantage to having the leader not
> participate -- unlike for parallel queries, where it is quite possible
> that setting parallel_leader_participation=off could be a win, even
> generally. If you just have a Gather over a parallel sequential scan,
> it is unlikely that parallel_leader_participation=off will help; it
> will most likely hurt, at least up to the point where more
> participants become a bad idea in general due to contention.

It's unlikely to hurt much, since as you yourself said,
compute_parallel_worker() doesn't consider the leader's participation.
Actually, if we assume that compute_parallel_worker() is perfect, then
surely parallel_leader_participation=off would beat
parallel_leader_participation=on for CREATE INDEX -- it would allow us
to use the value that compute_parallel_worker() truly intended. Which
is the opposite of what you say about
parallel_leader_participation=off above.

I am only trying to understand your perspective here. I don't think
that parallel_leader_participation support is that important. I think
that parallel_leader_participation=off might be slightly useful as a
way of discouraging parallel CREATE INDEX on smaller tables, just like
it is for parallel sequential scan (though this hinges on specifically
disallowing "degenerate parallel scan" cases). More often, it will
make hardly any difference if parallel_leader_participation is on or
off.

> In other words, right now, parallel_leader_participation is not
> strictly a testing GUC, but if we make CREATE INDEX respect it, then
> we're pushing it towards being a GUC that you don't ever want to
> enable except for testing. I'm still not sure that's a very good
> idea, but if we're going to do it, then surely we should be
> consistent.

I'm confused. I *don't* want it to be something that you can only use
for testing. I want to not hurt whatever case there is for the
parallel_leader_participation GUC being something that a DBA may tune
in production. I don't see the conflict here.

> It's true that having one worker and no parallel leader
> participation can never be better than just having the leader do it,
> but it is also true that having two leaders and no parallel leader
> participation can never be better than having 1 worker with leader
> participation. I don't see a reason to treat those cases differently.

You must mean "having two workers and no parallel leader participation...".

The reason to treat those two cases differently is simple: One
couldn't possibly be desirable in production, and undermines the whole
idea of parallel_leader_participation being user visible by adding a
sharp edge. The other is likely to be pretty harmless, especially
because leader participation is generally pretty fudged, and our cost
model is fairly rough. The difference here isn't what is important;
avoiding doing something that we know couldn't possibly help under any
circumstances is important. I think that we should do that on general
principle.

As I said in a prior e-mail, even parallel query's use of
parallel_leader_participation is consistent with what I propose here,
practically speaking, because a partial path without leader
participation will always lose to a serial sequential scan path in
practice. The fact that the optimizer will create a partial path that
makes a useless "degenerate parallel scan" a *theoretical* possibility
is irrelevant, because the optimizer has its own way of making sure
that such a plan doesn't actually get picked. It has its way, and so I
must have my own.

> If we're going to keep parallel_leader_participation support here, I
> think the last hunk in config.sgml should read more like this:
>
> Allows the leader process to execute the query plan under
> <literal>Gather</literal> and <literal>Gather Merge</literal> nodes
> and to participate in parallel index builds. The default is
> <literal>on</literal>. For queries, setting this value to
> <literal>off</literal> reduces the likelihood that workers will become
> blocked because the leader is not reading tuples fast enough, but
> requires the leader process to wait for worker processes to start up
> before the first tuples can be produced. The degree to which the
> leader can help or hinder performance depends on the plan type or
> index build strategy, number of workers and query duration. For index
> builds, setting this value to <literal>off</literal> is expected to
> reduce performance, but may be useful for testing purposes.

Why is CREATE INDEX really that different in terms of the downside for
production DBAs? I think it's more accurate to say that it's not
expected to improve performance. What do you think?

>> I suspect that the parameters of any cost model for parallel CREATE
>> INDEX that we're prepared to consider for v11 are: "Use a number of
>> parallel workers that is one below the number at which the total
>> duration of the CREATE INDEX either stays the same or goes up".
>
> That's pretty much the definition of a correct cost model; the trick
> is how to implement it without an oracle.

Correct on its own terms, at least. What I meant to convey here is
that there is little scope to do better in v11 on distributed costs
for the system as a whole, and therefore little scope to improve the
cost model.

>> BTW, the 32MB per participant limit within plan_create_index_workers()
>> was chosen based on the realization that any higher value would make
>> having a default setting of 2 for max_parallel_maintenance_workers (to
>> match the max_parallel_workers_per_gather default) pointless when the
>> default maintenance_work_mem value of 64MB is in use.

> I see. I think it's a good start. I wonder in general whether it's
> better to add memory or add workers. In other words, suppose I have a
> busy system where my index builds are slow. Should I try to free up
> some memory so that I can raise maintenance_work_mem, or should I try
> to free up some CPU resources so I can raise
> max_parallel_maintenance_workers?

This is actually all about distributed costs, I think. Provided you
have a reasonably sympathetic index build, like say a random numeric
column index build, and the index won't be multiple gigabytes in size,
then 1MB of maintenance_work_mem still seems to win with parallelism.
This seems extremely "selfish", though -- that's going to incur a lot
of random I/O for an operation that is typically characterized by
sequential I/O. Plus, I bet you're using quite a bit more memory than
1MB, in the form of FS cache. It seems hard to lose if you don't care
about distributed costs, especially if it's a matter of using 1 or 2
parallel workers versus just doing a serial build. Granted, you go
into a 1MB of maintenance_work_mem case below where parallelism loses,
which seems to contradict my suggestion that you practically cannot
lose with parallelism. However, ISTM that you really went out of your
way to find a case that lost.

Of course, I'm not arguing that it's okay for parallel CREATE INDEX to
be selfish -- it isn't. I'm prepared to say that you shouldn't use
parallelism if you have 1MB of maintenance_work_mem, no matter how
much it seems to help (and though it might sound crazy, because it is,
it *can* help). I'm just surprised that you've not said a lot more
about distributed costs, because that's where all the potential
benefit seems to be. It happens to be an area that we have no history
of modelling in any way, which makes it hard, but that's the situation
we seem to be in.

> I also wonder what the next steps would be to make this whole thing
> scale better. From the performance tests that have been performed so
> far, it seems like adding a modest number of workers definitely helps,
> but it tops out around 2-3x with 4-8 workers. I understand from your
> previous comments that's typical of other databases.

Yes. This patch seems to have scalability that is very similar to the
scalability that you get with similar features in other systems. I
have not verified this through first hand experience/experiments,
because I don't have access to that stuff. But I have found numerous
reports related to more than one other system. I don't think that this
is the only goal that matters, but I do think that it's an interesting
data point.

> It also seems
> pretty clear that more memory helps but only to a point. For
> instance, I just tried "create index x on pgbench_accounts (aid)"
> without your patch at scale factor 1000.

Again, I discourage everyone from giving too much weight to index
builds like this one. This does not involve very much sorting at all,
because everything is already in order, and the comparisons are cheap
int4 comparisons. It may already be very I/O bound before you start to
use parallelism.

> With maintenance_work_mem =
> 1MB, it generated 6689 runs and took 131 seconds. With
> maintenance_work_mem = 64MB, it took 67 seconds. With
> maintenance_work_mem = 1GB, it took 60 seconds. More memory didn't
> help, even if the sort could be made entirely internal. This seems to
> be a fairly typical pattern: using enough memory can buy you a small
> multiple, using a bunch of workers can buy you a small multiple, but
> then it just doesn't get faster.

Adding memory is just as likely to hurt slightly as help slightly,
especially if you're talking about CREATE INDEX, where being able to
use a final on-the-fly merge is a big deal (you can hide the cost of
the merging by doing it when you're very I/O bound anyway). This
should be true with only modest assumptions: I assume that you're in
one pass territory here, and that you have a reasonably small merge
heap (with perhaps no more than 100 runs). This seems likely to be
true the vast majority of the time with CREATE INDEX, assuming the
system is reasonably well configured. Roughly speaking, once my
assumptions are met, the exact number of runs almost doesn't matter
(that's at least useful as a mental model).

I basically disagree with the statement "using enough memory can buy
you a small multiple", since it's only true when you started out using
an unreasonably small amount of memory. Bear in mind that every time
maintenance_work_mem is doubled, our capacity to do sorts in one pass
quadruples. Using 1MB of maintenance_work_mem just doesn't make sense
*economically*, unless, perhaps, you care about neither the duration
of the CREATE INDEX statement, nor your electricity bill. You cannot
extrapolate anything useful from an index build that uses only 1MB of
maintenance_work_mem for all kinds of reasons.

I suggest taking another look at Prabhat's results. Here are my
observations about them:

* For serial sorts, a person reading his results could be forgiven for
thinking that increasing the amount of memory for a sort makes it go
*slower*, at least by a bit.

* Sometimes that doesn't happen for serial sorts, and sometimes it
does happen for parallel sorts, but mostly it hurts serial sorts and
helps parallel sorts, since Prabhat didn't start with an unreasonable
low amount of maintenance_work_mem.

* All the indexes are built against the same table. For the serial
cases, among each index that was built, the longest build took about
6x more time than the shortest. For parallel builds, it's more like a
3x difference. The difference gets smaller when you eliminate cases
that actually have to do almost no sorting. This "3x vs. 6x"
difference matters a lot.

This suggests to me that parallel CREATE INDEX has proven itself as
something that can take a mostly CPU bound index build, and make it
into a mostly I/O bound index build. It also suggests that we can make
better use of memory with parallel CREATE INDEX only because workers
will still need to get a reasonable amount of memory. You definitely
don't want multiple passes in workers, but for the same reasons that
you don't want them in serial cases.

> Yet, in theory, it seems like if
> we're willing to provide essentially unlimited memory and CPU
> resources, we ought to be able to make this go almost arbitrarily
> fast.

The main reason that the scalability of CREATE INDEX has trouble
getting past about 3.5x in cases we've seen doesn't involve any
scalability theory: we're very much I/O bound during the merge,
because we have to actually write out the index, regardless of what
tuplesort does or doesn't do. I've seen over 4x improvements on
systems that have sufficient temp file sequential I/O bandwidth, and
reasonably sympathetic data distributions/types.

>> WFM. Also added documentation for the wait events to monitoring.sgml,
>> which I somehow missed the first time around.
>
> But you forgot to update the preceding "morerows" line, so the
> formatting will be all messed up.

Fixed.

>> I removed "really". The point of the comment is that we've already set
>> up temp tablespaces for the shared fileset in the parallel case.
>> Shared filesets figure out which tablespaces will be used up-front --
>> see SharedFileSetInit().
>
> So why not say it that way? i.e. For parallel sorts, this should have
> been done already, but it doesn't matter if it gets done twice.

Okay.

> I don't see any reason not to make those contingent only on
> trace_sort. The user can puzzle apart which messages are which from
> the PIDs in the logfile.

Okay. I have removed anything that restrains the verbosity of
trace_sort for the WORKER() case. I think that you were right about it
the first time, but I now think that this is going too far. I'm
letting it go, though.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2018-01-18 03:28:05 Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Previous Message David Rowley 2018-01-18 03:14:51 Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning