Re: Using quicksort for every external sort run

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using quicksort for every external sort run
Date: 2016-04-02 14:31:45
Message-ID: 7142bf2a-9460-b549-61f3-8cb32a3209f9@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 03/30/2016 04:53 AM, Peter Geoghegan wrote:
> On Tue, Mar 29, 2016 at 6:02 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
...
>>> If there is ever a regression, it is only really sensible to talk
>>> about it while looking at trace_sort output (and, I guess, the query
>>> plan). I've asked Tomas for trace_sort output in all relevant cases.
>>> There is no point in "flying blind" and speculating what the problem
>>> was from a distance.
>>
>>
>> The updated benchmarks are currently running. I'm out of office until
>> Friday, and I'd like to process the results over the weekend. FWIW I'll have
>> results for these cases:
>>
>> 1) unpatched (a414d96a)
>> 2) patched, default settings
>> 3) patched, replacement_sort_mem=64
>>
>> Also, I'll have trace_sort=on output for all the queries, so we can
>> investigate further.
>
> Thanks! That will tell us a lot more.

So, I do have the results from both machines - I've attached the basic
comparison spreadsheets, the complete summary is available here:

https://github.com/tvondra/sort-benchmark

The database log also includes the logs for trace_sort=on for each query
(use the timestamp logged for each query in the spreadsheet to locate
the right section of the log).

The benchmark was slightly modified, based on the previous feedback:

* fix the maintenance_work_mem thinko (affects CREATE INDEX cases)

* use "SELECT * FROM (... OFFSET 1e10)" pattern instead of the original
approach (copy to /dev/null)

* change the data generation for "low cardinality" data sets (by mistake
it generated mostly the same stuff as "high cardinality")

I have not collected explain plans. I guess we'll need explain analyze
in most cases anyway, and collecting those would increase the duration
of the benchmark. So I plan to collect this info for the interesting
cases on request.

While it might look like I'm somehow opposed to this patch series,
that's mostly because we tend to look only at the few cases that behave
poorly.

So let me be clear: I do think the patch seems to be a significant
performance improvement for most of the queries, and I'm OK with
accepting a few regressions (particularly if we agree those are
pathological cases, unlikely to happen in real-world workloads).

It's quite rare that a patch is a universal win without regressions, so
it's important to consider how likely those regressions are and what's
the net effect of the patch - and the patch seems to be a significant
improvement in most cases (and regressions limited to pathological or
rare corner cases).

I don't think those are reasons not to push this into 9.6.

Following is a rudimentary analysis of the results, a bit about how the
benchmark was constructed (and it's representativeness).

rudimentary analysis
--------------------

I haven't done any thorough investigation of the results yet, but in
general it seems the results from both machines are quite similar - the
numbers are different, but the speedup/slowdown patterns are mostly the
same (with some exceptions that I'd guess are due to HW differences).

The slowdown/speedup patterns (red/green cells in the spreadheets) are
also similar to those collected originally. Some timings are much lower,
presumably thanks to using the "OFFSET 1e10" pattern, but the patterns
are the same. CREATE INDEX statements are an obvious exception, of
course, due to the thinko in the previous benchmark.

The one thing that surprised me a bit is that

replacement_sort_mem=64

actually often made the results considerably worse in many cases. A
common pattern is that the slowdown "spreads" to nearby cells - the are
many queries where the 8MB case is 1:1 with master and 32MB is 1.5:1
(i.e. takes 1.5x more time), and setting replacement_sort_mem=64 just
slows down the 8MB case.

In general, replacement_sort_mem=64 seems to only affect the 8MB case,
and in most cases it results in 100% slowdown (so 2x as long queries).

That being said, I do think the results are quite impressive - there are
far many queries with significant speedups (usually ~2x or more) than
slowdowns (and less significant than speedups).

I mostly agree with Peter that we probably don't need to worry about the
slowdown cases with low work_mem settings - if you do sorts with
millions of rows, you really need to give the database enough RAM.

But there are multiple slowdown cases with work_mem=128MB, and I'd dare
to say 128MB is not quite low-end work_mem value. So perhaps we should
look at least at those cases.

It's also interesting that setting replacement_sort_mem=64 makes this
much worse - i.e. the number of slowdowns with higher work_mem values
increases, and the difference is often quite huge.

So I'm really not sure what to do with this GUC ...

L2/L3 cache
-----------

I think we're overly optimistic when it comes to the size of the CPU
cache - while it's certainly true that modern CPUs have quite a bit of
it (the modern Xeon E5 have up to ~45MB per socket), there are two
important factors here:

1) The cache is shared by all cores on the socket (and on average
there's ~2-3 MB/s per physical core), and thus by all processes running
on the CPU. It's possible to run a single process on the CPU (thus
getting all the cache), but that's a bit expensive 1-core CPU.

2) The cache is shared by all nodes in the query plan, and we do have
executor that interleaves the nodes (so while an implementation of the
node may be very efficient when executed in isolation, that may not be
true when executed as part of a larger plan). The sort may be immune to
this to some degree, though.

I'm not sure how much this is considered in the 1994 VLDB paper, but I'd
be very careful about making claims about how much CPU cache is
available today (even on the best server CPUs).

benchmark discussion
--------------------

1) representativeness

Let me explain how I constructed the benchmark - I simply compiled a
list of queries executing sorts, and ran them on synthetic datasets with
different characteristics (cardinality and initial ordering). And I've
done that with different work_mem values, to see how that affects the
behavior.

I've done it this way for a few reasons - firstly, I'm extremely lazy
and did not want to study the internals of the patch as I'm not too much
into sorting details. Secondly, I did not want to tailor the benchmark
too tightly to the patch - it's quite possible some of the queries are
not executing the modified code at all, in which case they should be
unaffected (no slowdown, no speedup).

So while the benchmark might certainly include additional queries or
data sets with different characteristics, I'd dare to claim it's not
entirely misguided.

Some of the tested combinations may certainly be seen as implausible or
pathological, although intentional and not constructed on purpose. I'm
perfectly fine with identifying such cases and ignoring them.

2) TOAST overhead

Peter also mentioned that some of the cases have quite a bit of padding,
and that the TOAST overhead distorts the results. It's true there's
quite a bit of padding (~320B), but I don't quite see why this would
makes the results bogus - I've intentionally constructed it like this to
see how the sort behaves with wide rows, because:

* many BI queries actually fetch quite a lot of columns, and while 320B
may seem a bit high, it's not that difficult to reach with a few NUMERIC
columns

* we're getting parallel aggregate in 9.6, which relies on serializing
the aggregate state (and the combine phase may then need to do a sort again)

Moreover, while there certainly is TOAST overhead, I don't quite see why
it should change with the patch (as the padding columns are not used as
a sort key). Perhaps the patch results in "moving the tuples around
more" (deemphasizing comparison), but I don't see why that shouldn't be
an important metric in general - memory bandwidth seems to be a quite
important bottleneck these days. Of course, if this only affects the
pathological cases, we may ignore that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
i5.ods application/vnd.oasis.opendocument.spreadsheet 298.5 KB
xeon.ods application/vnd.oasis.opendocument.spreadsheet 312.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-04-02 14:59:16 Re: improving GROUP BY estimation
Previous Message Kevin Grittner 2016-04-02 13:33:53 Re: snapshot too old, configured by time