Re: Using quicksort for every external sort run

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using quicksort for every external sort run
Date: 2016-03-24 03:05:54
Message-ID: d6c65ebc-1f82-e44b-97ce-52248ee0b75a@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 03/24/2016 03:00 AM, Peter Geoghegan wrote:
> On Tue, Mar 22, 2016 at 3:35 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I've tested the patch you've sent on 2016/3/11, which I believe is the last
>> version. I haven't tuned the replacement_sort_mem at all. because
>> my understanding was that it's not a 9.6 material (per your
>> message). So my intent was to test the configuration people are
>> likely to use by default.
>
> I meant that using replacement selection in a special way with
> CREATE INDEX was not 9.6 material. But replacement_sort_mem is. And
> so, any case with the (maintenance)_work_mem <= 16MB will have used a
> heap for the first run.

FWIW, maintenance_work_mem was set to 1GB on the i5 machine and 256MB on
the Xeon. Hmm, maybe that's why we see no difference for CREATE INDEX on
the i5, and an improvement on the Xeon.

>
> I'm sorry I did not make a point of telling you this. It's my fault.
> The result in any case is that pre-sorted cases will be similar with
> and without the patch, since replacement selection can thereby make
> one long run. But on non-sorted cases, the patch helps less because
> it is in use less -- with not so much data overall, possibly much
> less (which I think explains why the 1M row tests seem so much less
> interesting than the 10M row tests).

Not a big deal - it's easy enough to change the config and repeat the
benchmark. Are there any particular replacement_sort_mem values that you
think would be interesting to configure?

I have to admit I'm a bit afraid we'll introduce a new GUC that only
very few users will know how to set properly, and so most people will
run with the default value or set it to something stupid.

>
> I worry that at the low end, replacement_sort_mem makes the patch
> have one long run, but still some more other runs, so merging is
> unbalanced. We should consider if the patch can beat the master
> branch at the low end without using a replacement selection heap. It
> would do better in at least some cases in low memory conditions,
> possibly a convincing majority of cases. I had hoped that my recent
> idea (since committed) of resetting memory contexts would help a lot
> with regressions when work_mem is very low, and that particular
> theory isn't really tested here.

Are you saying none of the queries triggers the memory context resets?
What queries would trigger that (to test the theory)?

>
>> I'm not sure which commit are you referring to. The benchmark was
>> done on a414d96a (from 2016/3/10). However I'd expect that to
>> affect both sets of measurements, although it's possible that it
>> affects the patched version differently.
>
> You did test the right patches. It just so happens that the master
> branch now has the memory batching stuff now, so it doesn't get
> credited with that. I think this is good, though, because we care
> about 9.5 -> 9.6 regressions.

So there's a commit in master (but not in 9.5), adding memory batching,
but it got committed before a414d96a so the benchmark does not measure
it's impact (with respect to 9.5). Correct?

But if we care about 9.5 -> 9.6 regressions, then perhaps we should
include that commit into the benchmark, because that's what the users
will see? Or have I misunderstood the second part?

BTW which patch does the memory batching? A quick search through git log
did not return any recent patches mentioning these terms.

> Improvement ratio (master time/patched time) for Xeon 10 million row
> case "SELECT * FROM int_test_padding ORDER BY a DESC":
>
> For work_mem of 8MB = 0.83, 32MB = 0.62, 128MB = 0.52, 512MB = 0.47,
> 1024MB = 1.00
>
> So, it gets faster than the master branch as more memory is
> available, but then it goes to 1.00 -- a perfect draw. I think that
> this happened simply because at that point, the sort was an internal
> sort (even though similar CREATE INDEX case did not go internal at
> the same point). The (internal) 1024MB case is not that much faster
> than the 512MB external case, which is pretty good.

Indeed.

>
> There are also "near draws", where the ratio is 0.98 or so. I think
> that this is because abbreviation is aborted, which can be a problem
> with synthetic data + text -- you get a very slow sort either way,

That is possible, yes. It's true that the worst regressions are on text,
although there are a few on numeric too (albeit not as significant).

> where most time is spent calling strcoll(), and cache
> characteristics matter much less. Those cases seemingly take much
> longer overall, so this theory makes sense. Unfortunately,
> abbreviated keys for text that is not C locale text was basically
> disabled across the board today due to a glibc problem. :-(

Yeah. Bummer :-(

>
> Whenever I see that the patch is exactly as fast as the master
> branch, I am skeptical. I am particularly skeptical of all i5
> results (including 10M cases), because the patch seems to be almost
> perfectly matched to the master branch for CREATE INDEX cases (which
> are the best cases for the patch on your Xeon server) -- it's much
> easier to believe that there was a problem during the test, honestly,
> like maintenance_work_mem wasn't set correctly. Those two things are

As I mentioned above, I haven't realized work_mem does not matter for
CREATE INDEX, and maintenance_work_mem was set to a fixed value for the
whole test. And the two machines used different values for this
particular configuration value - Xeon used just 256MB, while i5 used
1GB. So while on i5 it was just a single chunk, on Xeon there were
multiple batches. Hence the different behavior.

> so different that I have a hard time imagining that they'd ever
> really draw. I mean, it's possible, but it's more likely to be a
> problem with testing. And, queries like "SELECT * FROM
> int_test_padding ORDER BY a DESC" return all rows, which adds noise
> from all the client overhead. In fact, you often see that adding more

No it doesn't add overhead. The script actually does

COPY (query) TO '/dev/null'

on the server for all queries (except for the CREATE INDEX, obviously),
so there should be pretty much no overhead due to transferring rows to
the client and so on.

> memory helps no case here, so it seem a bit pointless. Maybe they
> should be written like "SELECT * FROM (select * from int_test_padding
> ORDER BY a DESC OFFSET 1e10) ff" instead. And maybe queries like
> "SELECT DISTINCT a FROM int_test ORDER BY a" would be better as
> "SELECT COUNT(DISTINCT a) FROM int_test", in order to test the
> datum/aggregate case. Just suggestions.

I believe the 'copy to /dev/null' achieves the same thing.

>
> If you really wanted to make the patch look good, a sort with 5GB of
> work_mem is the best way, FWIW. The heap data structure used by the
> master branch tuplesort.c will handle that very badly. You use no
> temp_tablespaces here. I wonder if the patch would do better with
> that. Sorting can actually be quite I/O bound with the patch
> sometimes, where it's usually CPU/memory bound with the heap,
> especially with lots of work_mem. More importantly, it would be more
> informative if the temp_tablespace was not affected by I/O from
> Postgres' heap.

I'll consider testing that. However, I don't think there was any
significant I/O on the machines - particularly not on the Xeon, which
has 16GB of RAM. So the temp files should fit into that quite easily.

The i5 machine has only 8GB of RAM, but it has 6 SSD drives in raid0. So
I doubt it was I/O bound.

>
> I also like seeing a sample of "trace_sort = on" output. I don't
> expect you to carefully collect that in every case, but it can tell
> us a lot about what's really going on when benchmarking.

Sure, I can collect that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2016-03-24 03:28:31 Re: Using quicksort for every external sort run
Previous Message Robert Haas 2016-03-24 02:44:38 Re: Relation extension scalability