Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-02-03 22:58:11
Message-ID: CAH2-WzkWGGQp7sCW-DuCz-XJSyON=d3ZONpDq6OC=Jt2eS2rCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
> show speed-up relative to that time):
>
> mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
> -----+--------+-------+-------+-------+-------+-------+-------+-------+-------
> 1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
> 64 | 19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
> 256 | 18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
> 512 | 17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x

I think that this presorted case doesn't improve much because the
sorting itself is so cheap, as explained in my last mail. However, the
improvements as workers are added is still smaller than expected. I
think that this indicates that there isn't enough I/O capacity
available here to truly show the full potential of the patch -- I've
certainly seen better scalability for cases like this when there is a
lot of I/O bandwidth available, and I/O parallelism is there to be
taken advantage of. Say, when using a system with a large RAID array
(I used a RAID0 array with 12 HDDs for my own tests). Another issue is
that you probably don't have enough data here to really show off the
patch. I don't want to dismiss the benchmark, which is still quite
informative, but it's worth pointing out that the feature is going to
be most compelling for very large indexes, that will take at least
several minutes to build under any circumstances. (Having a
reproducible case is also important, which what you have here has
going for it, on the other hand.)

I suspect that this system isn't particularly well balanced for the
task of benchmarking the patch. You would probably see notably better
scalability than any you've shown in any test if you could add
additional sequential I/O bandwidth, which is probably an economical,
practical choice for many users. I suspect that you aren't actually
saturating available CPUs to the greatest extent that the
implementations makes possible.

Another thing I want to point out is that with 1MB of
maintenance_work_mem, the patch appears to do very well, but that
isn't terribly meaningful. I would suggest that we avoid testing this
patch with such a low amount of memory -- it doesn't seem important.
This is skewed by the fact that you're using replacement selection in
the serial case only. I think what this actually demonstrates is that
replacement selection is very slow, even with its putative best case.
I believe that commit 2459833 was the final nail in the coffin of
replacement selection. I certainly don't want to relitigate the
discussion on replacement_sort_tuples, and am not going to push too
hard, but ISTM that we should fully remove replacement selection from
tuplesort.c and be done with it.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-02-03 23:00:59 Re: logical decoding of two-phase transactions
Previous Message Robert Haas 2017-02-03 22:47:50 Re: logical decoding of two-phase transactions