Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Using quicksort for every external sort run
Date: 2016-04-02 23:09:51
Message-ID: CAM3SWZQzNDrAq_LM2NpuLk_hAkd5TnRGSAsL0WNnt63L9+XH7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Apr 2, 2016 at 7:31 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> So, I do have the results from both machines - I've attached the basic
> comparison spreadsheets, the complete summary is available here:
>
> https://github.com/tvondra/sort-benchmark
>
> The database log also includes the logs for trace_sort=on for each query
> (use the timestamp logged for each query in the spreadsheet to locate the
> right section of the log).

Thanks!

Each row in these spreadsheets shows what looks like a multimodal
distribution for the patch (if you focus on the actual run times, not
the ratios). IOW, you can clearly see the regressions are only where
master has its best case, and the patch its worst case; as the
work_mem increases for each benchmark case for the patch, by far the
largest improvement is usually seen as we cross the CPU cache
threshold. Master gets noticeably slower as work_mem goes from 8MB to
32MB, but the patch gets far far faster. Things continue to improve
for patched in absolute terms and especially relative to master
following further increases in work_mem, but not nearly as
dramatically as that first increment (unless we have lots of padding,
which makes the memtuples heap itself much smaller, so it happens one
step later). Master shows a slow decline at and past 32MB of work_mem.
If the test hardware had a larger L3 cache, we might expect to notice
a second big drop, but this hardware doesn't have the enormous L3
cache sizes of new Xeon processors (e.g. 32MB, 45MB).

> While it might look like I'm somehow opposed to this patch series, that's
> mostly because we tend to look only at the few cases that behave poorly.
>
> So let me be clear: I do think the patch seems to be a significant
> performance improvement for most of the queries, and I'm OK with accepting a
> few regressions (particularly if we agree those are pathological cases,
> unlikely to happen in real-world workloads).
>
> It's quite rare that a patch is a universal win without regressions, so it's
> important to consider how likely those regressions are and what's the net
> effect of the patch - and the patch seems to be a significant improvement in
> most cases (and regressions limited to pathological or rare corner cases).
>
> I don't think those are reasons not to push this into 9.6.

I didn't think that you opposed the patch. In fact, you did the right
thing by focussing on the low-end regressions, as I've said. I was
probably too concerned about Robert failing to consider that they were
not representative, particularly with regard to how small the
memtuples heap could be relative to the CPU cache; blame it on how
close I've become to this problem. I'm pretty confident that Robert
can be convinced that these do not matter enough to not commit the
patch. In any case, I'm pretty confident that I cannot fix any
remaining regressions.

> I haven't done any thorough investigation of the results yet, but in general
> it seems the results from both machines are quite similar - the numbers are
> different, but the speedup/slowdown patterns are mostly the same (with some
> exceptions that I'd guess are due to HW differences).

I agree. What we clearly see is the advantages of quicksort being
cache oblivious, especially relative to master's use of a heap. That
advantage becomes pronounced at slightly different points in each
case, but the overall picture is the same. This pattern demonstrates
why a cache oblivious algorithm is so useful in general -- we don't
have to care about tuning for that. As important as this is for serial
sorts, it's even more important for parallel sorts, where parallel
workers compete for memory bandwidth, and where it's practically
impossible to build a cost model for CPU cache size + memory use +
nworkers.

> The slowdown/speedup patterns (red/green cells in the spreadheets) are also
> similar to those collected originally. Some timings are much lower,
> presumably thanks to using the "OFFSET 1e10" pattern, but the patterns are
> the same.

I think it's notable that this made things more predictable, and made
the benefits clearer.

> The one thing that surprised me a bit is that
>
> replacement_sort_mem=64
>
> actually often made the results considerably worse in many cases. A common
> pattern is that the slowdown "spreads" to nearby cells - the are many
> queries where the 8MB case is 1:1 with master and 32MB is 1.5:1 (i.e. takes
> 1.5x more time), and setting replacement_sort_mem=64 just slows down the 8MB
> case.
>
> In general, replacement_sort_mem=64 seems to only affect the 8MB case, and
> in most cases it results in 100% slowdown (so 2x as long queries).

To be clear, for the benefit of other people: replacement_sort_mem=64
makes the patch never use a replacement selection heap, even at the
lowest tested work_mem setting of 8MB.

This is exactly what I expected. When replacement_sort_mem is the
proposed default of 16MB, it literally has zero impact on how the
patch behaves where work_mem > replacement_sort_mem. So, since the
only case where work_mem <= replacement_sort_mem is when work_mem =
8MB, that's the only case where any change can be seen in either
direction. I thought it was important to see that (but more so when we
have cheap hardware with little CPU cache).

> That being said, I do think the results are quite impressive - there are far
> many queries with significant speedups (usually ~2x or more) than slowdowns
> (and less significant than speedups).
>
> I mostly agree with Peter that we probably don't need to worry about the
> slowdown cases with low work_mem settings - if you do sorts with millions of
> rows, you really need to give the database enough RAM.

Cool.

> But there are multiple slowdown cases with work_mem=128MB, and I'd dare to
> say 128MB is not quite low-end work_mem value. So perhaps we should look at
> least at those cases.
>
> It's also interesting that setting replacement_sort_mem=64 makes this much
> worse - i.e. the number of slowdowns with higher work_mem values increases,
> and the difference is often quite huge.
>
> So I'm really not sure what to do with this GUC ...

I think it mostly depends on how systems that might actually need
replacement_sort_mem do with and without it. I mean, cases that need
it because work_mem=8MB is generally reasonable, because low-end
hardware is in use. That's why I asked Greg to use cheap hardware at
least once. It matters more if work_mem=8MB is regressed when you have
a CPU cache size of 1MB (and there is no competition for the cache).

> L2/L3 cache
> -----------
>
> I think we're overly optimistic when it comes to the size of the CPU cache -
> while it's certainly true that modern CPUs have quite a bit of it (the
> modern Xeon E5 have up to ~45MB per socket), there are two important factors
> here:

> I'm not sure how much this is considered in the 1994 VLDB paper, but I'd be
> very careful about making claims about how much CPU cache is available today
> (even on the best server CPUs).

I agree. That's why it's so important that we use CPU cache effectively.

> benchmark discussion
> --------------------
>
> 1) representativeness

> I've done it this way for a few reasons - firstly, I'm extremely lazy and
> did not want to study the internals of the patch as I'm not too much into
> sorting details. Secondly, I did not want to tailor the benchmark too
> tightly to the patch - it's quite possible some of the queries are not
> executing the modified code at all, in which case they should be unaffected
> (no slowdown, no speedup).

That's right -- a couple of cases do not exercise the patch because
the sort is an internal sort. I think that this isn't too hard to
figure out now, though. I get why you did things this way. I
appreciate your help.

> Some of the tested combinations may certainly be seen as implausible or
> pathological, although intentional and not constructed on purpose. I'm
> perfectly fine with identifying such cases and ignoring them.

Me too. Or, if not ignoring them, only giving a very small weight to them.

> 2) TOAST overhead

> Moreover, while there certainly is TOAST overhead, I don't quite see why it
> should change with the patch (as the padding columns are not used as a sort
> key). Perhaps the patch results in "moving the tuples around more"
> (deemphasizing comparison), but I don't see why that shouldn't be an
> important metric in general - memory bandwidth seems to be a quite important
> bottleneck these days. Of course, if this only affects the pathological
> cases, we may ignore that.

That's fair. I probably shouldn't have mentioned TOAST at all --
what's actually important to keep in mind about padding cases, as
already mentioned, is that they can make the 32MB cases behave like
the 8MB cases. The memtuples heap is left relatively small for the
32MB case too, and so can remain cache resident. Replacement selection
therefore almost accidentally gets fewer heap cache misses for a
little longer, but it's still the same pattern. Cache misses come to
dominate a bit later.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2016-04-02 23:50:08 Re: Using quicksort for every external sort run
Previous Message Tom Lane 2016-04-02 22:43:19 Re: Add schema-qualified relnames in constraint error messages.