Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-02-04 00:15:13
Message-ID: CAEepm=1iGXvKo--UkktA1JM3EAfk9EyBEdNFWXpLKCY3HDwSzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 4, 2017 at 11:58 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
>> show speed-up relative to that time):
>>
>> mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
>> -----+--------+-------+-------+-------+-------+-------+-------+-------+-------
>> 1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
>> 64 | 19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
>> 256 | 18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
>> 512 | 17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x
>
> I think that this presorted case doesn't improve much because the
> sorting itself is so cheap, as explained in my last mail. However, the
> improvements as workers are added is still smaller than expected. I
> think that this indicates that there isn't enough I/O capacity
> available here to truly show the full potential of the patch -- I've
> certainly seen better scalability for cases like this when there is a
> lot of I/O bandwidth available, and I/O parallelism is there to be
> taken advantage of. Say, when using a system with a large RAID array
> (I used a RAID0 array with 12 HDDs for my own tests). Another issue is
> that you probably don't have enough data here to really show off the
> patch. I don't want to dismiss the benchmark, which is still quite
> informative, but it's worth pointing out that the feature is going to
> be most compelling for very large indexes, that will take at least
> several minutes to build under any circumstances. (Having a
> reproducible case is also important, which what you have here has
> going for it, on the other hand.)

Right. My main reason for starting smallish was to allow me to search
a space with several variables without waiting eons. Next I would
like to run a small subset of those tests with, say, 10, 20 or even
100 times more data loaded, so the tables would be ~20GB, ~40GB or
~200GB.

About read bandwidth: It shouldn't have been touching the disk at all
for reads: I did a dummy run of the index build before the measured
runs, so that a 2GB table being sorted in ~2 minutes would certainly
have come entirely from the OS page cache since the machine has oodles
of RAM.

About write bandwidth: The WAL, the index and the temp files all went
to an SSD array, though I don't have the characteristics of that to
hand. I should also be able to test on multi-spindle HDD array. I
doubt either can touch your 12 way RAID0 array, but will look into
that.

> I suspect that this system isn't particularly well balanced for the
> task of benchmarking the patch. You would probably see notably better
> scalability than any you've shown in any test if you could add
> additional sequential I/O bandwidth, which is probably an economical,
> practical choice for many users. I suspect that you aren't actually
> saturating available CPUs to the greatest extent that the
> implementations makes possible.

I will look into what IO options I can access before running larger
tests. Also I will look into running the test with both cold and warm
caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth
enters the picture.

> Another thing I want to point out is that with 1MB of
> maintenance_work_mem, the patch appears to do very well, but that
> isn't terribly meaningful. I would suggest that we avoid testing this
> patch with such a low amount of memory -- it doesn't seem important.
> This is skewed by the fact that you're using replacement selection in
> the serial case only. I think what this actually demonstrates is that
> replacement selection is very slow, even with its putative best case.
> I believe that commit 2459833 was the final nail in the coffin of
> replacement selection. I certainly don't want to relitigate the
> discussion on replacement_sort_tuples, and am not going to push too
> hard, but ISTM that we should fully remove replacement selection from
> tuplesort.c and be done with it.

Interesting. I haven't grokked this but will go and read about it.

Based on your earlier comments about banana skin effects, I'm
wondering if it would be interesting to add a couple more heap
distributions to the test set that are almost completely sorted except
for a few entries out of order.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-02-04 00:20:30 Re: Time to up bgwriter_lru_maxpages?
Previous Message Jim Nasby 2017-02-04 00:12:48 Re: Time to up bgwriter_lru_maxpages?