Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2015-11-18 23:29:50
Message-ID: CAM3SWZQS==-K+SpLsmPdzQ=nrXKMhyNDMELfJKWbpKx0a1+sdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Jeff,

On Wed, Nov 18, 2015 at 10:31 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> tuplesort.c: In function 'mergeruns':
> tuplesort.c:2741: warning: unused variable 'memNowUsed'

That was caused by a last-minute change to the mulitpass warning
message. I forgot to build at -O2, and missed this.

>> I believe, in general, that we should consider a multi-pass sort to be
>> a kind of inherently suspect thing these days, in the same way that
>> checkpoints occurring 5 seconds apart are: not actually abnormal, but
>> something that we should regard suspiciously. Can you really not
>> afford enough work_mem to only do one pass?
>
> I don't think it is really about the cost of RAM. What people can't
> afford is spending all of their time personally supervising all the
> sorts on the system. It is pretty easy for a transient excursion in
> workload to make a server swap itself to death and fall over. Not just
> the PostgreSQL server, but the entire OS. Since we can't let that
> happen, we have to be defensive about work_mem. Yes, we have far more
> RAM than we used to. We also have far more things demanding access to
> it at the same time.

I agree with you, but I'm not sure that I've been completely clear on
what I mean. Even as the demand on memory has grown, the competitive
advantage of replacement selection in avoiding a multi-pass merge has
diminished far faster. You should simply not allow it to happen as a
DBA -- that's the advice that other systems' documentation makes.
Avoiding a multi-pass merge was always the appeal of replacement
selection, even in the 1970s, but it will rarely if ever make that
critical difference these days.

As I said, as the volume of data to be sorted in memory increases
linearly, the point at which a multi-pass merge phase happens
increases quadratically with my patch. The advantage of replacement
selection is therefore almost irrelevant. That is why, in general,
interest in replacement selection is far far lower today than it was
in the past.

The poor CPU cache characteristics of the heap (priority queue) are
only half the story about why replacement selection is more or less
obsolete these days.

> I agree we don't want to optimize for low memory, but I don't think we
> should throw it under the bus, either. Right now we are effectively
> saying the CPU-cache problems with the heap start exceeding the larger
> run size benefits at 64kb (the smallest allowed setting for work_mem).
> While any number we pick is going to be a guess that won't apply to
> all hardware, surely we can come up with a guess better than 64kb.
> Like, 8 MB, say. If available memory for the sort is 8MB or smaller
> and the predicted size anticipates a multipass merge, then we can use
> the heap method rather than the quicksort method. Would a rule like
> that complicate things much?

I'm already using replacement selection for the first run when it is
predicted by my new ad-hoc cost model that we can get away with a
"quicksort with spillover", avoiding almost all I/O. We only
incrementally spill as many tuples as needed right now, but it would
be pretty easy to not quicksort the remaining tuples, but continue to
incrementally spill everything. So no, it wouldn't be too hard to hang
on to the old behavior sometimes, if it looked worthwhile.

In principle, I have no problem with doing that. Through testing, I
cannot see any actual upside, though. Perhaps I just missed something.
Even 8MB is enough to avoid the multipass merge in the event of a
surprisingly high volume of data (my work laptop is elsewhere, so I
don't have my notes on this in front of me, but I figured out the
crossover point for a couple of cases).

>> In theory, the answer could be "yes", but it seems highly unlikely.
>> Not only is very little memory required to avoid a multi-pass merge
>> step, but as described above the amount required grows very slowly
>> relative to linear growth in input. I propose to add a
>> checkpoint_warning style warning (with a checkpoint_warning style GUC
>> to control it).
>
> I'm skeptical about a warning for this.

Other systems expose this explicitly, and, as I said, say in an
unqualified way that a multi-pass merge should be avoided. Maybe the
warning isn't the right way of communicating that message to the DBA
in detail, but I am confident that it ought to be communicated to the
DBA fairly clearly.

> One idea would be to stop and write out a just-sorted partition
> whenever that partition is contiguous to the already-written portion.
> If the qsort is tweaked to recurse preferentially into the left
> partition first, this would result in tuples being written out at a
> pretty study pace. If the qsort was unbalanced and the left partition
> was always the larger of the two, then that approach would have to be
> abandoned at some point. But I think there are already defenses
> against that, and at worst you would give up and revert to the
> sort-them-all then write-them-all behavior.

Seems kind of invasive.

> Overall this is very nice. Doing some real world index builds of
> short text (~20 bytes ascii) identifiers, I could easily get speed ups
> of 40% with your patch if I followed the philosophy of "give it as
> much maintenance_work_mem as I can afford". If I fine-tuned the
> maintenance_work_mem so that it was optimal for each sort method, then
> the speed up quite a bit less, only 22%. But 22% is still very
> worthwhile, and who wants to spend their time fine-tuning the memory
> use for every index build?

Thanks, but I expected better than that. Was it a collated text
column? The C collation will put the patch in a much better light
(more strcoll() calls are needed with this new approach -- it's still
well worth it, but it is a downside that makes collated text not
especially sympathetic). Just sorting on an integer attribute is also
a good sympathetic case, FWIW.

How much time did the sort take in each case? How many runs? How much
time was spent merging? trace_sort output is very interesting here.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-11-19 00:49:00 Re: Trivial heap_finish_speculative() error message inaccuracy
Previous Message Andres Freund 2015-11-18 22:50:30 Re: Trivial heap_finish_speculative() error message inaccuracy