Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-16 09:17:32
Message-ID: CANEvxPreActgvvORjq=wvFOLz-fhS9-fuqycPXm-X+47KJ8vOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

I have been continue doing testing of parallel create index patch. So far
I haven't came across any sort of issue or regression with the patches.
Here are few performance number for the latest round of testing - which
is perform on top of 6th Jan patch submitted by Peter.

Testing is done on openstack instance with:

CUP: 8
RAM : 16GB
HD: 640 GB

postgres=# select pg_size_pretty(pg_total_relation_size
('lineitem'));
pg_size_pretty
----------------
93 GB
(1 row)

-- Test 1.
max_parallel_workers_maintenance = 2
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 1GB
max_wal_size = 4GB

-- Test 2.
max_parallel_workers_maintenance = 4
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 2GB
max_wal_size = 4GB

-- Test 3.
max_parallel_workers_maintenance = 8
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 4GB
max_wal_size = 4GB

NOTE: All the time taken entries are the median of 3 consecutive runs for
the same B-tree index creation query.

Time taken for Parallel Index createion:
Test 1 Test 2 Test 3
Simple/Composite Indexes: Without patch With patch ,
max_parallel_workers_maintenance = 2 % Change Without patch With patch,
max_parallel_workers_maintenance = 4 % Change Without patch With patch,
max_parallel_workers_maintenance = 8 % Change
Index on "bigint" column:
CREATE INDEX li_ordkey_idx1 ON lineitem(l_orderkey); 1062446.462 ms
(17:42.446) 1024972.273 ms
(17:04.972) 3.52 % 1053468.945 ms
(17:33.469) 896375.543 ms
(14:56.376) 17.75 % 1082920.703 ms
(18:02.921) 932550.058 ms
(15:32.550) 13.88 %
index on "integer" column:
CREATE INDEX li_lineno_idx2 ON lineitem(l_linenumber); 1538285.499 ms
(25:38.285) 1201008.423 ms
(20:01.008) 21.92 % 1529837.023 ms
(25:29.837) 1014188.489 ms
(16:54.188) 33.70 % 1642160.947 ms
(27:22.161) 978518.253 ms
(16:18.518) 40.41 %
index on "numeric" column:
CREATE INDEX li_qty_idx3 ON lineitem(l_quantity); 3968102.568 ms
(01:06:08.103) 2359304.405 ms
(39:19.304) 40.54 % 4129510.930 ms
(01:08:49.511) 1680201.644 ms
(28:00.202) 59.31 % 4348248.210 ms
(01:12:28.248) 1490461.879 ms
(24:50.462) 65.72 %
index on "character" column:
CREATE INDEX li_lnst_idx4 ON lineitem(l_linestatus); 1510273.931 ms
(25:10.274) 1240265.301 ms
(20:40.265) 17.87 % 1516842.985 ms
(25:16.843) 995730.092 ms
(16:35.730) 34.35 % 1580789.375 ms
(26:20.789) 984975.746 ms
(16:24.976) 37.69 %
index on "date" column:
CREATE INDEX li_shipdt_idx5 ON lineitem(l_shipdate); 1483603.274 ms
(24:43.603) 1189704.930 ms
(19:49.705) 19.80 % 1498348.925 ms
(24:58.349) 1040421.626 ms
(17:20.422) 30.56 % 1653651.499 ms
(27:33.651) 1016305.794 ms
(16:56.306) 38.54 %
index on "character varying" column:
CREATE INDEX li_comment_idx6 ON lineitem(l_comment); 6945953.838 ms
(01:55:45.954) 4329696.334 ms
(01:12:09.696) 37.66 % 6818556.437 ms
(01:53:38.556) 2834034.054 ms
(47:14.034) 58.43 % 6942285.711 ms
(01:55:42.286) 2648430.902 ms
(44:08.431) 61.85 %
composite index on "numeric", "character" columns:
CREATE INDEX li_qtylnst_idx34 ON lineitem
(l_quantity, l_linestatus); 4961563.400 ms
(01:22:41.563) 2959722.178 ms
(49:19.722) 40.34 % 5242809.501 ms
(01:27:22.810) 2077463.136 ms
(34:37.463) 60.37 % 5576765.727 ms
(01:32:56.766) 1755829.420 ms
(29:15.829) 68.51 %
composite index on "date", "character varying" columns:
CREATE INDEX li_shipdtcomment_idx56 ON lineitem
(l_shipdate, l_comment); 4693318.077 ms
(01:18:13.318) 3181494.454 ms
(53:01.494) 32.21 % 4627624.682 ms
(01:17:07.625) 2613289.211 ms
(43:33.289) 43.52 % 4719242.965 ms
(01:18:39.243) 2685516.832 ms
(44:45.517) 43.09 %

*Thanks & Regards,*

*Prabhat Kumar Sahu*
Skype ID: prabhat.sahu1984

www.enterprisedb.co <http://www.enterprisedb.com/>m
<http://www.enterprisedb.com/>

On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> > More comments:
>
> Attached patch has all open issues worked through, including those
> that I respond to or comment on below, as well as the other feedback
> from your previous e-mails. Note also that I fixed the issue that Amit
> raised, as well as the GetOldestXmin()-argument bug that I noticed in
> passing when responding to Amit. I also worked on the attribution in
> the commit message.
>
> Before getting to my responses to your most recent round of feedback,
> I want to first talk about some refactoring that I decided to do. As
> you can see from the master branch, tuplesort_performsort() isn't
> necessarily reached for spool2, even when we start out with a spool2
> (that is, for many unique index builds, spool2 never even does a
> tuplesort_performsort()). We may instead decide to shut down spool2
> when it has no (dead) tuples. I made this work just as well for the
> parallel case in this latest revision. I had to teach tuplesort.c to
> accept an early tuplesort_end() for LEADER() -- it had to be prepared
> to release still-waiting workers in some cases, rather than depending
> on nbtsort.c having called tuplesort_performsort() already. Several
> routines within nbtsort.c that previously knew something about
> parallelism now know nothing about it. This seems like a nice win.
>
> Separately, I took advantage of the fact that within the leader, its
> *worker* Tuplesortstate can safely call tuplesort_end() before the
> leader state's tuplesort_performsort() call.
>
> The overall effect of these two changes is that there is now a
> _bt_leader_heapscan() call for the parallel case that nicely mirrors
> the serial case's IndexBuildHeapScan() call, and once we're done with
> populating spools, no subsequent code needs to know a single thing
> about parallelism as a special case. You may notice some small changes
> to the tuplesort.h overview, which now advertises that callers can
> take advantage of this leeway.
>
> Now on to my responses to your most recent round of feeback...
>
> > BufFileView() looks fairly pointless. It basically creates a copy of
> > the input and, in so doing, destroys the input, which is a lot like
> > returning the input parameter except that it uses more cycles. It
> > does do a few things.
>
> While it certainly did occur to me that that was kind of weird, and I
> struggled with it on my own for a little while, I ultimately agreed
> with Thomas that it added something to have ltsConcatWorkerTapes()
> call some buffile function in every iteration of its loop.
> (BufFileView() + BufFileViewAppend() are code that Thomas actually
> wrote, though I added the asserts and comments myself.)
>
> If you think about this in terms of the interface rather than the
> implementation, then it may make more sense. The encapsulation adds
> something which might pay off later, such as when extendBufFile()
> needs to work with a concatenated set of BufFiles. And even right now,
> I cannot simply reuse the BufFile without then losing the assert that
> is currently in BufFileViewAppend() (must not have associated shared
> fileset assert). So I'd end up asserting less (rather than more) there
> if BufFileView() was removed.
>
> It wastes some cycles to not simply use the BufFile directly, but not
> terribly many in the grand scheme of things. This happens once per
> external sort operation.
>
> > In miscadmin.h, I'd put the prototype for the new GUC next to
> > max_worker_processes, not maintenance_work_mem.
>
> But then I'd really have to put it next to max_worker_processes in
> globals.c, too. That would mean that it would go under "Primary
> determinants of sizes of shared-memory structures" within globals.c,
> which seems wrong to me. What do you think?
>
> > The ereport() in index_build will, I think, confuse people when it
> > says that there are 0 parallel workers. I suggest splitting this into
> > two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
> > "building index \"%s\" on table \"%s\" serially" ...) else ereport(...
> > "building index \"%s\" on table \"%s\" in parallel with request for %d
> > parallel workers" ...).
>
> WFM. I've simply dropped any reference to leader participation in the
> messages here, to keep things simple. This seemed okay because the
> only thing that affects leader participation is the
> parallel_leader_participation GUC, which is under the user's direct
> control at all times, and is unlikely to be changed. Those that really
> want further detail have trace_sort for that.
>
> > The logic in IndexBuildHeapRangeScan() around need_register_snapshot
> > and OldestXmin seems convoluted and not very well-edited to me.
>
> Having revisited it, I now agree that the code added to
> IndexBuildHeapRangeScan() was unclear, primarily in that the
> need_unregister_snapshot local variable was overloaded in a weird way.
>
> > I suggest that you go back to the original code organization
> > and then just insert an additional case for a caller-supplied scan, so
> > that the overall flow looks like this:
> >
> > if (scan != NULL)
> > {
> > ...
> > }
> > else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
> > {
> > ...
> > }
> > else
> > {
> > ...
> > }
>
> The problem that I see with this alternative flow is that the "if
> (scan != NULL)" and the "else if (IsBootstrapProcessingMode() ||
> indexInfo->ii_Concurrent)" blocks clearly must contain code for two
> distinct, non-overlapping cases, despite the fact those two cases
> actually do overlap somewhat. That is, a call to
> IndexBuildHeapRangeScan() may have a (parallel) heap scan argument
> (control reaches your first code block), or may not (control reaches
> your second or third code block). At the same time, a call to
> IndexBuildHeapRangeScan() may use SnapShotAny (ordinary CREATE INDEX),
> or may need an MVCC snapshot (either by registering its own, or using
> the parallel one). These two things are orthogonal.
>
> I think I still get the gist of what you're saying, though. I've come
> up with a new structure that is a noticeable improvement on what I
> had. Importantly, the new structure let me add a number of
> parallelism-agnostic asserts that make sure that every ambuild routine
> that supports parallelism gets the details right.
>
> > Along with that, I'd change the name of need_register_snapshot to
> > need_unregister_snapshot (it's doing both jobs right now) and
> > initialize it to false.
>
> Done.
>
> > + * This support code isn't reliable when called from within a parallel
> > + * worker process due to the fact that our state isn't propagated.
> This is
> > + * why parallel index builds are disallowed on catalogs. It is possible
> > + * that we'll fail to catch an attempted use of a user index undergoing
> > + * reindexing due the non-propagation of this state to workers, which
> is not
> > + * ideal, but the problem is not particularly likely to go undetected
> due to
> > + * our not doing better there.
> >
> > I understand the first two sentences, but I have no idea what the
> > third one means, especially the part that says "not particularly
> > likely to go undetected due to our not doing better there". It sounds
> > scary that something bad is only "not particularly likely to go
> > undetected"; don't we need to detect bad things reliably?
>
> The primary point here, that you said you understood, is that we
> definitely need to detect when we're reindexing a catalog index within
> the backend, so that systable_beginscan() can do the right thing and
> not use the index (we also must avoid assertion failures). My solution
> to that problem is, of course, to not allow the use of parallel create
> index when REINDEXing a system catalog. That seems 100% fine to me.
>
> There is a little bit of ambiguity about other cases, though -- that's
> the secondary point I tried to make within that comment block, and the
> part that you took issue with. To put this secondary point another
> way: It's possible that we'd fail to detect it if someone's comparator
> went bananas and decided it was okay to do SQL access (that resulted
> in an index scan of the index undergoing reindex). That does seem
> rather unlikely, but I felt it necessary to say something like this
> because ReindexIsProcessingIndex() isn't already something that only
> deals with catalog indexes -- it works with all indexes.
>
> Anyway, I reworded this. I hope that what I came up with is clearer than
> before.
>
> > But also,
> > you used the word "not" three times and also the prefix "un-", meaning
> > "not", once. Four negations in 13 words! Perhaps I'm not entirely in
> > a position to cast aspersions on overly-complex phraseology -- the pot
> > calling the kettle black and all that -- but I bet that will be a lot
> > clearer if you reduce the number of negations to either 0 or 1.
>
> You're not wrong. Simplified.
>
> > The comment change in standard_planner() doesn't look helpful to me;
> > I'd leave it out.
>
> Okay.
>
> > + * tableOid is the table that index is to be built on. indexOid is the
> OID
> > + * of a index to be created or reindexed (which must be a btree index).
> >
> > I'd rewrite that first sentence to end "the table on which the index
> > is to be built". The second sentence should say "an index" rather
> > than "a index".
>
> Okay.
>
> > But, actually, I think we would be better off just ripping
> > leaderWorker/leaderParticipates out of this function altogether.
> > compute_parallel_worker() is not really under any illusion that it's
> > computing a number of *participants*; it's just computing a number of
> > *workers*.
>
> That distinction does seem to cause plenty of confusion. While I
> accept what you say about compute_parallel_worker(), I still haven't
> gone as far as removing the leaderParticipates argument altogether,
> because compute_parallel_worker() isn't the only thing that matters
> here. (More on that below.)
>
> > I think it's fine for
> > parallel_leader_participation=off to simply mean that you get one
> > fewer participants. That's actually what would happen with parallel
> > query, too. Parallel query would consider
> > parallel_leader_participation later, in get_parallel_divisor(), when
> > working out the cost of one path vs. another, but it doesn't use it to
> > choose the number of workers. So it seems to me that getting rid of
> > all of the workerLeader considerations will make it both simpler and
> > more consistent with what we do for queries.
>
> I was aware of those details, and figured that parallel query fudges
> the compute_parallel_worker() figure's leader participation in some
> sense, and that that was what I needed to compensate for. After all,
> when parallel_leader_participation=off, having
> compute_parallel_worker() return 1 means rather a different thing to
> what it means with parallel_leader_participation=on, even though in
> general we seem to assume that parallel_leader_participation can only
> make a small difference overall.
>
> Here's what I've done based on your feedback: I've changed the header
> comments, but stopped leaderParticipates from affecting the
> compute_parallel_worker() calculation (so, as I said,
> leaderParticipates stays). The leaderParticipates argument continues
> to affect these two aspects of plan_create_index_workers()'s return
> value:
>
> 1. It continues to be used so we have a total number of participants
> (not workers) to apply our must-have-32MB-workMem limit on
> participants.
>
> Parallel query has no equivalent of this, and it seems warranted. Note
> that this limit is no longer applied when parallel_workers storage
> param was set, as discussed.
>
> 2. I continue to use the leaderParticipates argument to disallow the
> case where there is only one CREATE INDEX participant but parallelism
> is in use, because, of course, that clearly makes no sense -- we
> should just use a serial sort instead.
>
> (It might make sense to allow this if parallel_leader_participation
> was *purely* a testing GUC, only for use by by backend hackers, but
> AFAICT it isn't.)
>
> The planner can allow a single participant parallel sequential scan
> path to be created without worrying about the fact that that doesn't
> make much sense, because a plan with only one parallel participant is
> always going to cost more than some serial plan (you will only get a 1
> participant parallel sequential scan when force_parallel_mode is on).
> Obviously plan_create_index_workers() doesn't generate (partial) paths
> at all, so I simply have to get the same outcome (avoiding a senseless
> 1 participant parallel operation) some other way here.
>
> > If you have an idea how to make a better
> > cost model than this for CREATE INDEX, I'm willing to consider other
> > options. If you don't, or want to propose that as a follow-up patch,
> > then I think it's OK to use what you've got here for starters. I just
> > don't want it to be more baroque than necessary.
>
> I suspect that the parameters of any cost model for parallel CREATE
> INDEX that we're prepared to consider for v11 are: "Use a number of
> parallel workers that is one below the number at which the total
> duration of the CREATE INDEX either stays the same or goes up".
>
> It's hard to do much better than this within those parameters. I can
> see a fairly noticeable benefit to parallelism with 4 parallel workers
> and a measly 1MB of maintenance_work_mem (when parallelism is forced)
> relative to the serial case with the same amount of memory. At least
> on my laptop, it seems to be rather hard to lose relative to a serial
> sort when using parallel CREATE INDEX (to be fair, I'm probably
> actually using way more memory than 1MB to do this due to FS cache
> usage). I can think of a cleverer approach to costing parallel CREATE
> INDEX, but it's only cleverer by weighing distributed costs. Not very
> relevant, for the time being.
>
> BTW, the 32MB per participant limit within plan_create_index_workers()
> was chosen based on the realization that any higher value would make
> having a default setting of 2 for max_parallel_maintenance_workers (to
> match the max_parallel_workers_per_gather default) pointless when the
> default maintenance_work_mem value of 64MB is in use. That's not
> terribly scientific, though it at least doesn't come at the expense of
> a more scientific idea for a limit like that (I don't actually have
> one, you see). I am merely trying to avoid being *gratuitously*
> wasteful of shared resources that are difficult to accurately cost in
> (e.g., the distributed cost of random I/O to the system as a whole
> when we do a parallel index build while ridiculously low on
> maintenance_work_mem).
>
> > I think that the naming of the wait events could be improved. Right
> > now, they are named by which kind of process does the waiting, but it
> > really should be named based on what the thing for which we're
> > waiting. I also suggest that we could just write Sort instead of
> > Tuplesort. In short, I suggest ParallelTuplesortLeader ->
> > ParallelSortWorkersDone and ParallelTuplesortLeader ->
> > ParallelSortTapeHandover.
>
> WFM. Also added documentation for the wait events to monitoring.sgml,
> which I somehow missed the first time around.
>
> > Not for this patch, but I wonder if it might be a worthwhile future
> > optimization to allow workers to return multiple tapes to the leader.
> > One doesn't want to go crazy with this, of course. If the worker
> > returns 100 tapes, then the leader might get stuck doing multiple
> > merge passes, which would be a foolish way to divide up the labor, and
> > even if that doesn't happen, Amdahl's law argues for minimizing the
> > amount of work that is not done in parallel. Still, what if a worker
> > (perhaps after merging) ends up with 2 or 3 tapes? Is it really worth
> > merging them so that the leader can do a 5-way merge instead of a
> > 15-way merge?
>
> I did think about this myself, or rather I thought specifically about
> building a serial/bigserial PK during pg_restore, a case that must be
> very common. The worker merges for such an index build will typically
> be *completely pointless* when all input runs are in sorted order,
> because the merge heap will only need to consult the root of the heap
> and its two immediate children throughout (commit 24598337c helped
> cases like this enormously). You might as well merge hundreds of runs
> in the leader, provided you still have enough memory per tape that you
> can get the full benefit of OS readahead (this is not that hard when
> you're only going to repeatedly read from the same tape anyway).
>
> I'm not too worried about it, though. The overall picture is still
> very positive even in this case. The "extra worker merging" isn't
> generally a big proportion of the overall cost, especially there. More
> importantly, if I tried to do better, it would be the "quicksort with
> spillover" cost model story all over again (remember how tedious that
> was?). How hard are we prepared to work to ensure that we get it right
> when it comes to skipping worker merging, given that users always pay
> some overhead, even when that doesn't happen?
>
> Note also that parallel index builds manage to unfairly *gain*
> advantage over serial cases (they have the good variety of dumb luck,
> rather than the bad variety) in certain other common cases. This
> happens with an *inverse* physical/logical correlation (e.g. a DESC
> index builds on a date field). They manage to artificially do better
> than theory would predict, simply because a greater number of smaller
> quicksorts are much faster during initial run generation, without also
> taking a concomitant performance hit at merge time. Thomas showed this
> at one point. Note that even that's only true because of the qsort
> precheck (what I like to call the "banana skin prone" precheck, that
> we added to our qsort implementation in 2006) -- it would be true for
> *all* correlations, but that one precheck thing complicates matters.
>
> All of this is a tricky business, and that isn't going to get any easier
> IMV.
>
> > + * Make sure that the temp file(s) underlying the tape set are
> created in
> > + * suitable temp tablespaces. This is only really needed for serial
> > + * sorts.
> >
> > This comment makes me wonder whether it is "sorta" needed for parallel
> sorts.
>
> I removed "really". The point of the comment is that we've already set
> up temp tablespaces for the shared fileset in the parallel case.
> Shared filesets figure out which tablespaces will be used up-front --
> see SharedFileSetInit().
>
> > - if (trace_sort)
> > + if (trace_sort && !WORKER(state))
> >
> > I have a feeling we still want to get this output even from workers,
> > but maybe I'm missing something.
>
> I updated tuplesort_end() so that trace_sort reports on the end of the
> sort, even for worker processes. (We still don't show generic
> tuplesort_begin* message for workers, though.)
>
> > + arg5 indicates serial, parallel worker, or parallel leader
> sort.</entry>
> >
> > I think it should say what values are used for each case.
>
> I based this on "arg0 indicates heap, index or datum sort", where it's
> implied that the values are respective to the order that they appear
> in in the sentence (starting from 0). But okay, I'll do it that way
> all the same.
>
> > + /* Release worker tuplesorts within leader process as soon as
> possible */
> >
> > IIUC, the worker tuplesorts aren't really holding onto much of
> > anything in terms of resources. I think it might be better to phrase
> > this as /* The sort we just did absorbed the final tapes produced by
> > these tuplesorts, which are of no further use. */ or words to that
> > effect.
>
> Okay. Done that way.
>
> > Instead of making a special case in CreateParallelContext for
> > serializable_okay, maybe index_build should just use SetConfigOption()
> > to force the isolation level to READ COMMITTED right after it does
> > NewGUCNestLevel(). The change would only be temporary because the
> > subsequent call to AtEOXact_GUC() will revert it.
>
> I tried doing it that way, but it doesn't seem workable:
>
> postgres=# begin transaction isolation level serializable ;
> BEGIN
> postgres=*# reindex index test_unique;
> ERROR: 25001: SET TRANSACTION ISOLATION LEVEL must be called before any
> query
> LOCATION: call_string_check_hook, guc.c:9953
>
> Note that AutoVacLauncherMain() uses SetConfigOption() to set/modify
> default_transaction_isolation -- not transaction_isolation.
>
> Instead, I added a bit more to comments within
> CreateParallelContext(), to justify what I've done along the lines you
> went into. Hopefully this works better for you.
>
> > There is *still* more to review here, but my concentration is fading.
> > If you could post an updated patch after adjusting for the comments
> > above, I think that would be helpful. I'm not totally out of things
> > to review that I haven't already looked over once, but I think I'm
> > close.
>
> I'm impressed with how quickly you're getting through review of the
> patch. Hopefully we can keep that momentum up.
>
> Thanks
> --
> Peter Geoghegan
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Yuto Hayamizu 2018-01-16 09:45:22 Re: [HACKERS] [PATCH] Overestimated filter cost and its mitigation
Previous Message Kyotaro HORIGUCHI 2018-01-16 09:04:51 Re: [HACKERS] Restricting maximum keep segments by repslots