Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-11-15 15:44:06
Message-ID: CA+TgmoYiGH72XKQH_ayTQn_n=sMYGuYHZ1-67MyzX0n-ZHQp6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 9, 2016 at 10:18 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> Maybe another way of putting this is that, while there's clearly a
>> benefit to having some kind of a cap, it's appropriate to pick a large
>> value, such as 500. Having no cap at all risks creating many extra
>> tapes that just waste memory, and also risks an unduly
>> cache-inefficient final merge. Reigning that in makes sense.
>> However, we can't reign it in too far or we'll create slow polyphase
>> merges in case that are reasonably likely to occur in real life.
>
> I completely agree with your analysis.

Cool. BTW, my run with 10 tapes completed in 10696528.377 ms
(02:58:16.528) - i.e. almost 3 minutes slower than with no tape limit.
Building runs took 4260.16 s, and the final merge pass began at
8239.12 s. That's certainly better than I expected, and it seems to
show that even if the number of tapes is grossly inadequate for the
number of runs, you can still make up most of the time that you lose
to I/O with improved cache efficiency -- at least under favorable
circumstances. Of course, on many systems I/O bandwidth will be a
scarce resource, so that argument can be overdone -- and even if not,
a 10-tape sort version takes FAR longer to deliver the first tuple.

I also tried this out with work_mem = 512MB. Doubling work_mem
reduces the number of runs enough that we don't get a polyphase merge
in any case. With no limit on tapes:

2016-11-10 11:24:45 UTC [54042] LOG: switching to external sort with
1873 tapes: CPU: user: 11.34 s, system: 0.48 s, elapsed: 12.13 s
2016-11-10 12:36:22 UTC [54042] LOG: finished writing run 308 to tape
307: CPU: user: 4096.63 s, system: 156.88 s, elapsed: 4309.66 s
2016-11-10 12:36:22 UTC [54042] LOG: using 516563 KB of memory for
read buffers among 308 input tapes
2016-11-10 12:36:30 UTC [54042] LOG: performsort done (except 308-way
final merge): CPU: user: 4097.75 s, system: 157.24 s, elapsed: 4317.76
s
2016-11-10 13:54:07 UTC [54042] LOG: external sort ended, 6255577
disk blocks used: CPU: user: 8638.72 s, system: 177.42 s, elapsed:
8974.44 s

With a max_sort_tapes = 501:

2016-11-10 14:23:50 UTC [54042] LOG: switching to external sort with
502 tapes: CPU: user: 10.99 s, system: 0.54 s, elapsed: 11.57 s
2016-11-10 15:36:47 UTC [54042] LOG: finished writing run 278 to tape
277: CPU: user: 4190.31 s, system: 155.33 s, elapsed: 4388.86 s
2016-11-10 15:36:47 UTC [54042] LOG: using 517313 KB of memory for
read buffers among 278 input tapes
2016-11-10 15:36:54 UTC [54042] LOG: performsort done (except 278-way
final merge): CPU: user: 4191.36 s, system: 155.68 s, elapsed: 4395.66
s
2016-11-10 16:53:39 UTC [54042] LOG: external sort ended, 6255699
disk blocks used: CPU: user: 8673.07 s, system: 175.93 s, elapsed:
9000.80 s

0.3% slower with the tape limit, but that might be noise. Even if
not, it seems pretty silly to create 1873 tapes when we only need
~300.

At work_mem = 2GB:

2016-11-10 18:08:00 UTC [54042] LOG: switching to external sort with
7490 tapes: CPU: user: 44.28 s, system: 1.99 s, elapsed: 46.33 s
2016-11-10 19:23:06 UTC [54042] LOG: finished writing run 77 to tape
76: CPU: user: 4342.10 s, system: 156.21 s, elapsed: 4551.95 s
2016-11-10 19:23:06 UTC [54042] LOG: using 2095202 KB of memory for
read buffers among 77 input tapes
2016-11-10 19:23:12 UTC [54042] LOG: performsort done (except 77-way
final merge): CPU: user: 4343.36 s, system: 157.07 s, elapsed: 4558.79
s
2016-11-10 20:24:24 UTC [54042] LOG: external sort ended, 6255946
disk blocks used: CPU: user: 7894.71 s, system: 176.36 s, elapsed:
8230.13 s

At work_mem = 2GB, max_sort_tapes = 501:

2016-11-10 21:28:23 UTC [54042] LOG: switching to external sort with
502 tapes: CPU: user: 44.09 s,
system: 1.94 s, elapsed: 46.07 s
2016-11-10 22:42:28 UTC [54042] LOG: finished writing run 68 to tape
67: CPU: user: 4278.49 s, system: 154.39 s, elapsed: 4490.25 s
2016-11-10 22:42:28 UTC [54042] LOG: using 2095427 KB of memory for
read buffers among 68 input tapes
2016-11-10 22:42:34 UTC [54042] LOG: performsort done (except 68-way
final merge): CPU: user: 4279.60 s, system: 155.21 s, elapsed: 4496.83
s
2016-11-10 23:42:10 UTC [54042] LOG: external sort ended, 6255983
disk blocks used: CPU: user: 7733.98 s, system: 173.85 s, elapsed:
8072.55 s

Roughly 2% faster. Maybe still noise, but less likely. 7490 tapes
certainly seems over the top.

At work_mem = 8GB:

2016-11-14 19:17:28 UTC [54042] LOG: switching to external sort with
29960 tapes: CPU: user: 183.80 s, system: 7.71 s, elapsed: 191.61 s
2016-11-14 20:32:02 UTC [54042] LOG: finished writing run 20 to tape
19: CPU: user: 4431.44 s, system: 176.82 s, elapsed: 4665.16 s
2016-11-14 20:32:02 UTC [54042] LOG: using 8388083 KB of memory for
read buffers among 20 input tapes
2016-11-14 20:32:26 UTC [54042] LOG: performsort done (except 20-way
final merge): CPU: user: 4432.99 s, system: 181.29 s, elapsed: 4689.52
s
2016-11-14 21:30:56 UTC [54042] LOG: external sort ended, 6256003
disk blocks used: CPU: user: 7835.83 s, system: 199.01 s, elapsed:
8199.29 s

At work_mem = 8GB, max_sort_tapes = 501:

2016-11-14 21:52:43 UTC [54042] LOG: switching to external sort with
502 tapes: CPU: user: 181.08 s, system: 7.66 s, elapsed: 189.05 s
2016-11-14 23:06:06 UTC [54042] LOG: finished writing run 17 to tape
16: CPU: user: 4381.56 s, system: 161.82 s, elapsed: 4591.63 s
2016-11-14 23:06:06 UTC [54042] LOG: using 8388158 KB of memory for
read buffers among 17 input tapes
2016-11-14 23:06:36 UTC [54042] LOG: performsort done (except 17-way
final merge): CPU: user: 4383.45 s, system: 165.32 s, elapsed: 4622.04
s
2016-11-14 23:54:00 UTC [54042] LOG: external sort ended, 6256002
disk blocks used: CPU: user: 7124.49 s, system: 182.16 s, elapsed:
7466.18 s

Roughly 9% faster. Building runs seems to be very slowly degrading as
we increase work_mem, but the final merge is speeding up somewhat more
quickly. Intuitively that makes sense to me: if merging were faster
than quicksorting, we could just merge-sort all the time instead of
using quicksort for internal sorts. Also, we've got 29960 tapes now,
better than three orders of magnitude more than what we actually need.
At this work_mem setting, 501 tapes is enough to efficiently sort at
least 4TB of data and quite possibly a good bit more.

So, committed 0001, with comment changes along the lines I proposed before.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-11-15 15:49:58 Re: postgres_fdw and defaults
Previous Message Pavel Stehule 2016-11-15 15:44:05 Re: proposal: psql \setfileref