Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-02-16 02:05:14
Message-ID: CAEepm=3A3srTeZUy_iQQyPpzV6S7M1dq+0rOZYgYr6JaQUN+hQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 4, 2017 at 2:45 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> It might just have been that the table was too small to be an
> effective target for parallel sequential scan with so many workers,
> and so a presorted best case CREATE INDEX, which isn't that different,
> also fails to see much benefit (compared to what you'd see with a
> similar case involving a larger table). In other words, I might have
> jumped the gun in emphasizing issues with hardware and I/O bandwidth
> over issues around data volume (that I/O parallelism is inherently not
> very helpful with these relatively small tables).
>
> As I've pointed out a couple of times before, bigger sorts will be
> more CPU bound because sorting itself has costs that grow
> linearithmically, whereas writing out runs has costs that grow
> linearly. The relative cost of the I/O can be expected to go down as
> input goes up for this reason. At the same time, a larger input might
> make better use of I/O parallelism, which reduces the cost paid in
> latency to write out runs in absolute terms.

Here are some results with your latest patch, using the same test as
before but this time with SCALE=100 (= 100,000,000 rows). The table
sizes are:

List of relations
Schema | Name | Type | Owner | Size | Description
--------+----------------------+-------+--------------+-------+-------------
public | million_words | table | thomas.munro | 42 MB |
public | some_words | table | thomas.munro | 19 MB |
public | test_intwide_u_asc | table | thomas.munro | 18 GB |
public | test_intwide_u_desc | table | thomas.munro | 18 GB |
public | test_intwide_u_rand | table | thomas.munro | 18 GB |
public | test_textwide_u_asc | table | thomas.munro | 19 GB |
public | test_textwide_u_desc | table | thomas.munro | 19 GB |
public | test_textwide_u_rand | table | thomas.munro | 19 GB |

To reduce the number of combinations I did only unique data and built
only non-unique indexes with only 'wide' tuples (= key plus a text
column that holds a 151-character wide string, rather than just the
key), and also didn't bother with the 1MB memory size as suggested.
Here are the results up to 4 workers (a results table going up to 8
workers is attached, since it wouldn't format nicely if I pasted it
here). Again, the w = 0 time is seconds, the rest show relative
speed-up. This data was all in the OS page cache because of a dummy
run done first, and I verified with 'sar' that there was exactly 0
reading from the block device. The CPU was pegged on leader + workers
during sort runs, and then the leader's CPU hovered around 93-98%
during the merge/btree build. I had some technical problems getting a
cold-cache read-from-actual-disk-each-time test run to work properly,
but can go back and do that again if anyone thinks that would be
interesting data to see.

tab | ord | mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4
----------+------+-----+---------+-------+-------+-------+-------
intwide | asc | 64 | 67.91 | 1.26x | 1.46x | 1.62x | 1.73x
intwide | asc | 256 | 67.84 | 1.23x | 1.48x | 1.63x | 1.79x
intwide | asc | 512 | 69.01 | 1.25x | 1.50x | 1.63x | 1.80x
intwide | desc | 64 | 98.08 | 1.48x | 1.83x | 2.03x | 2.25x
intwide | desc | 256 | 99.87 | 1.43x | 1.80x | 2.03x | 2.29x
intwide | desc | 512 | 104.09 | 1.44x | 1.85x | 2.09x | 2.33x
intwide | rand | 64 | 138.03 | 1.56x | 2.04x | 2.42x | 2.58x
intwide | rand | 256 | 139.44 | 1.61x | 2.04x | 2.38x | 2.56x
intwide | rand | 512 | 138.96 | 1.52x | 2.03x | 2.28x | 2.57x
textwide | asc | 64 | 207.10 | 1.20x | 1.07x | 1.09x | 1.11x
textwide | asc | 256 | 200.62 | 1.19x | 1.06x | 1.04x | 0.99x
textwide | asc | 512 | 191.42 | 1.16x | 0.97x | 1.01x | 0.94x
textwide | desc | 64 | 1382.48 | 1.89x | 2.37x | 3.18x | 3.87x
textwide | desc | 256 | 1427.99 | 1.89x | 2.42x | 3.24x | 4.00x
textwide | desc | 512 | 1453.21 | 1.86x | 2.39x | 3.23x | 3.75x
textwide | rand | 64 | 1587.28 | 1.89x | 2.37x | 2.66x | 2.75x
textwide | rand | 256 | 1557.90 | 1.85x | 2.34x | 2.64x | 2.73x
textwide | rand | 512 | 1547.97 | 1.87x | 2.32x | 2.64x | 2.71x

"textwide" "asc" is nearly an order of magnitude faster than other
initial orders without parallelism, but then parallelism doesn't seem
to help it much. Also, using more that 64MB doesn't ever seem to help
very much; in the "desc" case it hinders.

I was curious to understand how performance changes if we become just
a bit less correlated (rather than completely uncorrelated or
perfectly inversely correlated), so I tried out a 'banana skin' case:
I took the contents of the textwide asc table and copied it to a new
table, and then moved the 900 words matching 'banana%' to the physical
end of the heap by deleting and reinserting them in one transaction.
I guess if we were to use this technology for CLUSTER, this might be
representative of a situation where you regularly recluster a growing
table. The results were pretty much like "asc":

tab | ord | mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4
----------+--------+-----+--------+-------+-------+-------+-------
textwide | banana | 64 | 213.39 | 1.17x | 1.11x | 1.15x | 1.09x

It's hard to speculate about this, but I guess that a significant
number of indexes in real world databases might be uncorrelated to
insert order. A newly imported or insert-only table might have one
highly correlated index for a surrogate primary key or time column,
but other indexes might tend to be uncorrelated. But really, who
knows... in a kind of textbook perfectly correlated case such as a
time series table with an append-only time or sequence based key, you
might want to use BRIN rather than B-Tree anyway.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
speedup-100.txt text/plain 1.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-02-16 02:08:59 Re: Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags
Previous Message Michael Paquier 2017-02-16 01:55:02 Re: [Bug fix] PQsendQuery occurs error when target_session_attrs is set to read-write