Quick Links

Re: Using quicksort for every external sort run

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Greg S <stark(at)mit(dot)edu>
Subject:	Re: Using quicksort for every external sort run
Date:	2016-02-04 11:14:02
Message-ID:	CAM3SWZQO7nY_UthWEH8JOEu3rSwRVyJOW3K+D9yjLXhvLq9TmA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Feb 4, 2016 at 1:46 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> It wasn't my original insight that replacement selection has become
> all but obsolete. It took me a while to come around to that point of
> view.

Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]:

"By comparison, OpenVMS sort uses a pure replacement-selection sort to
generate runs (Knuth, 1973). Replacement-selection is best for a
memory-constrained environment. On average, replacement-selection
generates runs that are twice as large as available memory, while the
QuickSort runs are typically less than half of available memory.
However, in a memory-rich environment, QuickSort is faster because it
is simpler, makes fewer exchanges on average, and has superior address
locality to exploit processor caching. "

(I believe that the authors state that "QuickSort runs are typically
less than half of available memory" because of the use of explicit
asynchronous I/O in each thread, which doesn't apply to us).

The paper also has very good analysis of the economics of sorting:

"Even for surprisingly large sorts, it is economical to perform the
sort in one pass."

Of course, memory capacities have scaled enormously in the 20 years
since this analysis was performed, so the analysis applies even at the
very low end these days. The high capacity memory system that they
advocate to get a one pass sort (instead of having faster disks) had
100MB of memory, which is of course tiny by contemporary standards. If
you pay Heroku $7 a month, you get a "Hobby Tier" database with 512MB
of memory. The smallest EC2 instance size, the t2.nano, costs about
$1.10 to run for one week, and has 0.5GB of memory.

The economics of using 4MB or even 20MB to sort 10GB of data are
already preposterously bad for everyone that runs a database server,
no matter how budget conscious they may be. I can reluctantly accept
that we need to still use a heap with very low work_mem settings to
avoid the risk of a regression (in the event of a strong correlation)
on general principle, but I'm well justified in proposing "just don't
do that" as the best practical advice.

I thought I had your agreement on that point, Robert; is that actually the case?

[1] http://www.cs.berkeley.edu/~rxin/db-papers/alphasort.pdf

--
Peter Geoghegan

In response to

Re: Using quicksort for every external sort run at 2016-02-04 09:46:15 from Peter Geoghegan

Responses

Re: Using quicksort for every external sort run at 2016-02-05 17:31:10 from Robert Haas
Re: Using quicksort for every external sort run at 2016-04-02 00:29:03 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2016-02-04 11:24:18	Re: 2016-01 Commitfest
Previous Message	Rushabh Lathia	2016-02-04 10:44:41	Re: Optimization for updating foreign tables in Postgres FDW