Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-10-08 00:47:13
Message-ID: CAM3SWZTmkOFEiCDpUNaO4n9-1xcmWP-1NXmT7h0Pu3gM2YuHvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Sun, Sep 11, 2016 at 11:05 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> So, while there are still a few loose ends with this revision (it
> should still certainly be considered WIP), I wanted to get a revision
> out quickly because V1 has been left to bitrot for too long now, and
> my schedule is very full for the next week, ahead of my leaving to go
> on vacation (which is long overdue). Hopefully, I'll be able to get
> out a third revision next Saturday, on top of the
> by-then-presumably-committed new tape batch memory patch from Heikki,
> just before I leave. I'd rather leave with a patch available that can
> be cleanly applied, to make review as easy as possible, since it
> wouldn't be great to have this V2 with bitrot for 10 days or more.

Heikki committed his preload memory patch a little later than
originally expected, 4 days ago. I attach V3 of my own parallel CREATE
INDEX patch, which should be applied on top of a today's git master
(there is a bugfix that reviewers won't want to miss -- commit
b56fb691). I have my own test suite, and have to some extent used TDD
for this patch, so rebasing was not so bad . My tests are rather rough
and ready, so I'm not going to post them here. (Changes in the
WaitLatch() API also caused bitrot, which is now fixed.)

Changes from V2:

* Since Heikki eliminated the need for any extra memtuple "slots"
(memtuples is now only exactly big enough for the initial merge heap),
an awful lot of code could be thrown out that managed sizing memtuples
in the context of the parallel leader (based on trends seen in
parallel workers). I was able to follow Heikki's example by
eliminating code for parallel sorting memtuples sizing. Throwing this
code out let me streamline a lot of stuff within tuplesort.c, which is
cleaned up quite a bit.

* Since this revision was mostly focused on fixing up logtape.c
(rebasing on top of Heikki's work), I also took the time to clarify
some things about how an block-based offset might need to be applied
within the leader. Essentially, outlining how and where that happens,
and where it doesn't and shouldn't happen. (An offset must sometimes
be applied to compensate for difference in logical BufFile positioning
(leader/worker differences) following leader's unification of worker
tapesets into one big tapset of its own.)

* max_parallel_workers_maintenance now supersedes the use of the new
parallel_workers index storage parameter. This matches existing heap
storage parameter behavior, and allows the implementation to add
practically no cycles as compared to master branch when the use of
parallelism is disabled by setting max_parallel_workers_maintenance to
0.

* New additions to the chapter in the documentation that Robert added
a little while back, "Chapter 15. Parallel Query". It's perhaps a bit
of a stretch to call this feature part of parallel query, but I think
that it works reasonably well. The optimizer does determine the number
of workers needed here, so while it doesn't formally produce a query
plan, I think the implication that it does is acceptable for
user-facing documentation. (Actually, it would be nice if you really
could EXPLAIN utility commands -- that would be a handy place to show
information about how they were executed.)

Maybe this new documentation describes things in what some would
consider to be excessive detail for users. The relatively detailed
information added on parallel sorting seemed to be in the pragmatic
spirit of the new chapter 15, so I thought I'd see what people
thought.

Work is still needed on:

* Cost model. Should probably attempt to guess final index size, and
derive calculation of number of workers from that. Also, I'm concerned
that I haven't given enough thought to the low end, where with default
settings most CREATE INDEX statements will use at least one parallel
worker.

* The whole way that I teach nbtsort.c to disallow catalog tables for
parallel CREATE INDEX due to concerns about parallel safety is in need
of expert review, preferably from Robert. It's complicated in a way
that relies on things happening or not happening from a distance.

* Heikki seems to want to change more about logtape.c, and its use of
indirection blocks. That may evolve, but for now I can only target the
master branch.

* More extensive performance testing. I think that this V3 is probably
the fastest version yet, what with Heikki's improvements, but I
haven't really verified that.

--
Peter Geoghegan

Attachment Content-Type Size
0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz application/x-gzip 1.8 KB
0002-Add-parallel-B-tree-index-build-sorting.patch.gz application/x-gzip 54.8 KB
0003-Add-force_btree_randomaccess-GUC-for-testing.patch.gz application/x-gzip 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-10-08 05:47:38 Re: Speed up Clog Access by increasing CLOG buffers
Previous Message Tom Lane 2016-10-07 23:58:58 Re: [COMMITTERS] pgsql: Remove -Wl,-undefined,dynamic_lookup in macOS build.