Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Greg S <stark(at)mit(dot)edu>
Subject: Re: Using quicksort for every external sort run
Date: 2016-01-30 07:25:38
Message-ID: CAM3SWZSpn1qoUqrqUgSW0uR2vVq3w+GPBrs4DDi47Q0b7hJcvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 29, 2016 at 2:58 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I don't quite know what you mean by these numbers. Add a generic,
> conservative threshold to what?

I meant use "quicksort with spillover" simply because an estimated
90%+ of all tuples have already been consumed. Don't consider the
tuple width, etc.

> Thinking about this some more, I really think we should think hard
> about going back to the strategy which you proposed and discarded in
> your original post: always generate the first run using replacement
> selection, and every subsequent run by quicksorting. In that post you
> mention powerful advantages of this method: "even without a strong
> logical/physical correlation, the algorithm tends to produce runs that
> are about twice the size of work_mem. (It's also notable that
> replacement selection only produces one run with mostly presorted
> input, even where input far exceeds work_mem, which is a neat trick.)"
> You went on to dismiss that strategy, saying that "despite these
> upsides, replacement selection is obsolete, and should usually be
> avoided." But I don't see that you've justified that statement.

Really? Just try it with a heap that is not tiny. Performance tanks.
The fact that replacement selection can produce one long run then
becomes a liability, not a strength. With a work_mem of something like
1GB, it's *extremely* painful.

> It seems pretty easy to construct cases where this technique regresses,
> and a large percentage of those cases are precisely those where
> replacement selection would have produced a single run, avoiding the
> merge step altogether.

...*and* where many passes are otherwise required (otherwise, the
merge is still cheap enough to leave us ahead). Typically with very
small work_mem settings, like 4MB, and far larger data volumes. It's
easy to construct those cases, but that doesn't mean that they
particularly matter. Using 4MB of work_mem to sort 10GB of data is
penny wise and pound foolish. The cases we've seen regressed are
mostly a concern because misconfiguration happens.

A compromise that may be acceptable is to always do a "quicksort with
spillover" when there is a very low work_mem setting and the estimate
of the number of input tuples is less than 10x of what we've seen so
far. Maybe less than 20MB. That will achieve the same thing.

> I'm quite willing to run somewhat more slowly than in other cases to
> be certain of not regressing the case of completely or
> almost-completely ordered input. Even if that didn't seem like a
> sufficient reason unto itself, I'd be willing to go that way just so
> we don't have to depend on a cost model that might easily go wrong due
> to bad input even if it were theoretically perfect in every other
> respect (which I'm pretty sure is not true here anyway).

The consequences of being wrong either way are not severe (note that
making one long run isn't a goal of the cost model currently).

> I also have another idea that might help squeeze more performance out
> of your approach and avoid regressions. Suppose that we add a new GUC
> with a name like sort_mem_stretch_multiplier or something like that,
> with a default value of 2.0 or 4.0 or whatever we think is reasonable.
> When we've written enough runs that a polyphase merge will be
> required, or when we're actually performing a polyphase merge, the
> amount of memory we're allowed to use increases by this multiple. The
> idea is: we hope that people will set work_mem appropriately and
> consequently won't experience polyphase merges at all, but it might.
> However, it's almost certain not to happen very frequently.
> Therefore, using extra memory in such cases should be acceptable,
> because while you might have every backend in the system using 1 or
> more copies of work_mem for something if the system is very busy, it
> is extremely unlikely that you will have more than a handful of
> processes doing polyphase merges.

I'm not sure that that's practical. Currently, tuplesort decides on a
number of tapes ahead of time. When we're constrained on those, the
stretch multiplier would apply, but I think that that could be
invasive because the number of tapes ("merge order" + 1) was a
function of non-stretched work_mem.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2016-01-30 07:27:21 Re: Using quicksort for every external sort run
Previous Message Amit Kapila 2016-01-30 06:53:04 Re: [PATCH] Refactoring of LWLock tranches