Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using quicksort for every external sort run
Date: 2016-03-24 02:00:52
Message-ID: CAM3SWZQSodZDXmnhxHwo2rgSZmAkcKyBm830EaQJDq6G-02YbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 22, 2016 at 3:35 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> I've tested the patch you've sent on 2016/3/11, which I believe is the last
> version. I haven't tuned the replacement_sort_mem at all. because my
> understanding was that it's not a 9.6 material (per your message). So my
> intent was to test the configuration people are likely to use by default.

I meant that using replacement selection in a special way with CREATE
INDEX was not 9.6 material. But replacement_sort_mem is. And so, any
case with the (maintenance)_work_mem <= 16MB will have used a heap for
the first run.

I'm sorry I did not make a point of telling you this. It's my fault.
The result in any case is that pre-sorted cases will be similar with
and without the patch, since replacement selection can thereby make
one long run. But on non-sorted cases, the patch helps less because it
is in use less -- with not so much data overall, possibly much less
(which I think explains why the 1M row tests seem so much less
interesting than the 10M row tests).

I worry that at the low end, replacement_sort_mem makes the patch have
one long run, but still some more other runs, so merging is
unbalanced. We should consider if the patch can beat the master branch
at the low end without using a replacement selection heap. It would do
better in at least some cases in low memory conditions, possibly a
convincing majority of cases. I had hoped that my recent idea (since
committed) of resetting memory contexts would help a lot with
regressions when work_mem is very low, and that particular theory
isn't really tested here.

> I'm not sure which commit are you referring to. The benchmark was done on
> a414d96a (from 2016/3/10). However I'd expect that to affect both sets of
> measurements, although it's possible that it affects the patched version
> differently.

You did test the right patches. It just so happens that the master
branch now has the memory batching stuff now, so it doesn't get
credited with that. I think this is good, though, because we care
about 9.5 -> 9.6 regressions.

Improvement ratio (master time/patched time) for Xeon 10 million row
case "SELECT * FROM int_test_padding ORDER BY a DESC":

For work_mem of 8MB = 0.83, 32MB = 0.62, 128MB = 0.52, 512MB = 0.47,
1024MB = 1.00

So, it gets faster than the master branch as more memory is available,
but then it goes to 1.00 -- a perfect draw. I think that this happened
simply because at that point, the sort was an internal sort (even
though similar CREATE INDEX case did not go internal at the same
point). The (internal) 1024MB case is not that much faster than the
512MB external case, which is pretty good.

There are also "near draws", where the ratio is 0.98 or so. I think
that this is because abbreviation is aborted, which can be a problem
with synthetic data + text -- you get a very slow sort either way,
where most time is spent calling strcoll(), and cache characteristics
matter much less. Those cases seemingly take much longer overall, so
this theory makes sense. Unfortunately, abbreviated keys for text that
is not C locale text was basically disabled across the board today due
to a glibc problem. :-(

Whenever I see that the patch is exactly as fast as the master branch,
I am skeptical. I am particularly skeptical of all i5 results
(including 10M cases), because the patch seems to be almost perfectly
matched to the master branch for CREATE INDEX cases (which are the
best cases for the patch on your Xeon server) -- it's much easier to
believe that there was a problem during the test, honestly, like
maintenance_work_mem wasn't set correctly. Those two things are so
different that I have a hard time imagining that they'd ever really
draw. I mean, it's possible, but it's more likely to be a problem with
testing. And, queries like "SELECT * FROM int_test_padding ORDER BY a
DESC" return all rows, which adds noise from all the client overhead.
In fact, you often see that adding more memory helps no case here, so
it seem a bit pointless. Maybe they should be written like "SELECT *
FROM (select * from int_test_padding ORDER BY a DESC OFFSET 1e10) ff"
instead. And maybe queries like "SELECT DISTINCT a FROM int_test ORDER
BY a" would be better as "SELECT COUNT(DISTINCT a) FROM int_test", in
order to test the datum/aggregate case. Just suggestions.

If you really wanted to make the patch look good, a sort with 5GB of
work_mem is the best way, FWIW. The heap data structure used by the
master branch tuplesort.c will handle that very badly. You use no
temp_tablespaces here. I wonder if the patch would do better with
that. Sorting can actually be quite I/O bound with the patch
sometimes, where it's usually CPU/memory bound with the heap,
especially with lots of work_mem. More importantly, it would be more
informative if the temp_tablespace was not affected by I/O from
Postgres' heap.

I also like seeing a sample of "trace_sort = on" output. I don't
expect you to carefully collect that in every case, but it can tell us
a lot about what's really going on when benchmarking.

Thanks
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2016-03-24 02:01:44 Re: Fix for OpenSSL error queue bug
Previous Message Amit Kapila 2016-03-24 01:43:56 Re: Relation extension scalability