Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: hlinnaka <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-10-26 16:02:21
Message-ID: CAM3SWZSsvig8eZtxCG1L9bZZw+-uWdJ48yxSM41QmNLtJeQSyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> Setup:
>
> CREATE TABLE parallel_sort_test AS
> SELECT hashint8(i) randint,
> md5(i::text) collate "C" padding1,
> md5(i::text || '2') collate "C" padding2
> FROM generate_series(0, 1e9::bigint) i;
>
> CHECKPOINT;
>
> This leaves us with a parallel_sort_test table that is 94 GB in size.
>
> SET maintenance_work_mem = '8GB';
>
> -- Serial case (external sort, should closely match master branch):
> CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH
> (parallel_workers = 0);
>
> Total time: 00:15:42.15
>
> -- Patch with 8 tuplesort "sort-and-scan" workers (leader process
> participates as a worker here):
> CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH
> (parallel_workers = 7);
>
> Total time: 00:06:03.86
>
> As you can see, the parallel case is 2.58x faster

I decided to revisit this exact benchmark, using the same AWS instance
type (the one with 16 HDDs, again configured in software RAID0) to see
how things had changed for both parallel and serial cases. I am now
testing V4. A lot changed in the last 3 months, with most of the
changes that help here now already committed to the master branch.

Relevant changes
================

* Heikki's major overhaul of preload memory makes CREATE INDEX merging
have more sequential access patterns. It also effectively allows us to
use more memory. It's possible that the biggest benefit it brings to
parallel CREATE INDEX is that is eliminates almost any random I/O
penalty from logtape.c fragmentation that an extra merge pass has;
parallel workers now usually do their own merge to produce one big run
for the leader to merge. It also improves CPU cache efficiency quite
directly, I think.

This is the patch that helps most. Many thanks to Heikki for driving
this forward.

* My patch to simplify and optimize how the K-way merge heap is
maintained (as tuples fill leaf pages of the final index structure)
makes the merge phase significantly less CPU bound overall.

(These first two items particularly help parallel CREATE INDEX, which
spends proportionally much more wall clock time merging than would be
expected for similar serial cases. Serial cases do of course also
benefit.)

* V2 of the patch (and all subsequent versions) apportioned slices of
maintenance_work_mem to workers. maintenance_work_mem became a
per-utility-operation budget, regardless of number of workers
launched. This means that workers have less memory than the original
V1 benchmark (they simply don't make use of it now), but this seems
unlikely to hurt. Possibly, it even helps.

* Andres' work on md.c scalability may have helped (seems unlikely
with these CREATE INDEX cases that produce indexes not in the hundreds
of gigabytes, though). It would help with *extremely* large index
creation, which we won't really look at here.

Things now look better than ever for the parallel CREATE INDEX patch.
While it's typical for about 75% of wall clock time to be spent on
sorting runs with serial CREATE INDEX, with the remaining 25% going on
merging/writing index, with parallel CREATE INDEX I now generally see
about a 50/50 split between parallel sorting of runs (including any
worker merging to produce final runs) and serial merging for final
on-the-fly merge where we actually write new index out as input is
merged. This is a *significant* improvement over what we saw here back
in August, where it was not uncommon for parallel CREATE INDEX to
spend *twice* as much time in the serial final on-the-fly merge step.

All improvements to the code that we've seen since August have
targeted this final on-the-fly merge bottleneck. (The final on-the-fly
merge is now *consistently* able to write out the index at a rate of
150MB/sec+ in my tests, which is pretty good.)

New results
==========

Same setup as one quoted above -- once again, we "SET
maintenance_work_mem = '8GB'".

-- Patch with 8 tuplesort "sort-and-scan" workers:
CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 7);

Total time: 00:04:24.93

-- Serial case:
CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 0);

Total time: 00:14:25.19

3.27x faster. Not bad. As you see in the quoted text, that was 2.58x
back in August, even though the implementation now uses a lot less
memory in parallel workers. And, that's without even considering the
general question of how much faster index creation can be compared to
Postgres 9.6 -- it's roughly 3.5x faster at times.

New case
========

Separately, using my gensort tool [1], I came up with a new test case.
The tool generated a 2.5 billion row table, sized at 159GB. This is
how long is takes to produce a 73GB index on the "sortkey" column of
the resulting table:

-- gensort "C" locale text parallel case:
CREATE INDEX test8 on sort_test(sortkey) WITH (parallel_workers = 7);

Total time: 00:16:19.63

-- gensort "C" locale text serial case:
CREATE INDEX test0 on sort_test(sortkey) WITH (parallel_workers = 0);

Total time: 00:45:56.96

That's a 2.81x improvement in creation time relative to a serial case.
Not quite as big a difference as seen in the first case, but remember
that this is just like cases that were only made something like 2x -
2.2x faster by the use of parallelism back in August (see full e-mail
quoted above [2]). These are cases involving a text column, or maybe a
numeric column, that have complex comparators used during merging that
must handle detoasting, possibly even allocate memory, etc. This
second result is therefore probably the more significant of the two
results shown, since it now seems like we're more consistently close
to the ~3x improvement that other major database systems also seem to
top out at as parallel CREATE INDEX workers are added. (I still can't
see any benefit with 16 workers; my guess is that the anti-scaling
begins even before the merge starts when using that many workers. That
guess is hard to verify, given the confounding factor of more workers
producing more runs, leaving more work for the serial merge phase.)

I'd still welcome benchmarking or performance validation from somebody else.

[1] https://github.com/petergeoghegan/gensort
[2] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-10-26 16:04:55 what to do about unsupported encodings
Previous Message Euler Taveira 2016-10-26 15:51:51 Re: Fast Default WIP patch for discussion