Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-02-03 13:04:55
Message-ID: CAEepm=1_LTOeGGJr2BquEwUJmBnT6PtEQx5GmD=i42GvVS3bcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Wed, Feb 1, 2017 at 8:46 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> So I'm in favour of this patch, which is relatively simple and give us
>> faster index builds soon. Eventually we might also be able to have
>> approach 1. From what I gather, it's entirely possible that we might
>> still need 2 to fall back on in some cases.
>
> Right. And it can form the basis of an implementation of 1, which in
> any case seems to be much more compelling for parallel query, when a
> great deal more can be pushed down, and we are not particularly likely
> to be I/O bound (usually not much writing to the heap, or WAL
> logging).

I ran some tests today. First I created test tables representing the
permutations of these choices:

Table structure:

int = Integer key only
intwide = Integer key + wide row
text = Text key only (using dictionary words)
textwide = Text key + wide row

Uniqueness:

u = each value unique
d = 10 duplicates of each value

Heap physical order:

rand = Random
asc = Ascending order (already sorted)
desc = Descending order (sorted backwards)

I used 10 million rows for this test run, so that gave me 24 tables of
the following sizes as reported in "\d+":

int tables = 346MB each
intwide tables = 1817MB each
text tables = 441MB each
textwide tables = 1953MB each

It'd be interesting to test larger tables of course but I had a lot of
permutations to get through.

For each of those tables I ran tests corresponding to the permutations
of these three variables:

Index type:

uniq = CREATE UNIQUE INDEX ("u" tables only, ie no duplicates)
nonu = CREATE INDEX ("u" and "d" tables)

Maintenance memory: 1M, 64MB, 256MB, 512MB

Workers: from 0 up to 8

Environment: EDB test machine "cthulhu", Intel(R) Xeon(R) CPU E7-
8830 @ 2.13GHz, 8 socket, 8 cores (16 threads) per socket, CentOS
7.2, Linux kernel 3.10.0-229.7.2.el7.x86_64, 512GB RAM, pgdata on SSD.
Database initialised with en_US.utf-8 collation, all defaults except
max_wal_size increased to 4GB (otherwise warnings about too frequent
checkpoints) and max_parallel_workers_maintenance = 8. Testing done
with warm OS cache.

I applied your v2 patch on top of
7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict. It
still had a couple of harmless conflicts that I was able to deal with
(not code, just some header stuff moving around).

See full results from all permutations attached, but I wanted to
highlight the measurements from 'textwide', 'u', 'nonu' which show
interesting 'asc' numbers (data already sorted). The 'mem' column is
maintenance_work_mem in megabytes. The 'w = 0' column shows the time
in seconds for parallel_workers = 0. The other 'w = N' columns show
times with higher parallel_workers settings, represented as speed-up
relative to the 'w = 0' time.

1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
show speed-up relative to that time):

mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
64 | 19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
256 | 18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
512 | 17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x

2. 'rand' = randomised data:

mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
1 | 130.25 | 1.82x | 2.19x | 2.52x | 2.58x | 2.72x | 2.72x | 2.83x | 2.89x
64 | 117.36 | 1.80x | 2.20x | 2.43x | 2.47x | 2.55x | 2.51x | 2.59x | 2.69x
256 | 124.68 | 1.87x | 2.20x | 2.49x | 2.52x | 2.64x | 2.70x | 2.72x | 2.75x
512 | 115.77 | 1.51x | 1.72x | 2.14x | 2.08x | 2.19x | 2.31x | 2.44x | 2.48x

3. 'desc' = reverse-sorted data:

mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
1 | 115.19 | 1.88x | 2.39x | 2.78x | 3.50x | 3.62x | 4.20x | 4.19x | 4.39x
64 | 112.17 | 1.85x | 2.25x | 2.99x | 3.63x | 3.65x | 4.01x | 4.31x | 4.62x
256 | 119.55 | 1.76x | 2.21x | 2.85x | 3.43x | 3.37x | 3.77x | 4.24x | 4.28x
512 | 119.50 | 1.85x | 2.19x | 2.87x | 3.26x | 3.28x | 3.74x | 4.24x | 3.93x

The 'asc' effects are much less pronounced when the key is an int.
Here is the equivalent data for 'intwide', 'u', 'nonu':

1. 'asc'

mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
1 | 12.19 | 1.55x | 1.93x | 2.21x | 2.44x | 2.64x | 2.76x | 2.91x | 2.83x
64 | 7.35 | 1.29x | 1.53x | 1.69x | 1.86x | 1.98x | 2.04x | 2.07x | 2.09x
256 | 7.34 | 1.26x | 1.47x | 1.64x | 1.79x | 1.92x | 1.96x | 1.98x | 2.02x
512 | 7.24 | 1.24x | 1.46x | 1.65x | 1.80x | 1.91x | 1.97x | 2.00x | 1.92x

2. 'rand'

mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
1 | 15.16 | 1.56x | 2.01x | 2.32x | 2.57x | 2.73x | 2.87x | 2.95x | 2.91x
64 | 12.97 | 1.55x | 1.97x | 2.25x | 2.44x | 2.58x | 2.70x | 2.74x | 2.71x
256 | 13.14 | 1.47x | 1.86x | 2.12x | 2.31x | 2.50x | 2.62x | 2.58x | 2.69x
512 | 13.61 | 1.48x | 1.91x | 2.22x | 2.37x | 2.55x | 2.65x | 2.73x | 2.73x

3. 'desc'

mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
1 | 13.45 | 1.51x | 1.94x | 2.31x | 2.56x | 2.75x | 2.95x | 3.05x | 3.00x
64 | 10.27 | 1.42x | 1.82x | 2.05x | 2.30x | 2.46x | 2.59x | 2.64x | 2.65x
256 | 10.52 | 1.39x | 1.70x | 2.02x | 2.24x | 2.34x | 2.39x | 2.48x | 2.56x
512 | 10.62 | 1.43x | 1.82x | 2.06x | 2.32x | 2.51x | 2.61x | 2.68x | 2.69x

Full result summary and scripts used for testing attached.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
speedup.txt text/plain 15.6 KB
make_test_data.sh application/x-sh 2.6 KB
run_tests.sh application/x-sh 1.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2017-02-03 13:24:33 Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)
Previous Message Heikki Linnakangas 2017-02-03 12:52:52 Re: Password identifiers, protocol aging and SCRAM protocol