Re: Memory usage during sorting

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory usage during sorting
Date: 2012-04-13 18:01:16
Message-ID: CAEYLb_UbhYiZGm4z9gXyOEDGWKO_-3VCoMAdk4dERDEvXk0m8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13 April 2012 18:33, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> We do use insertion sort for partitions of size < 7.  I assume you are
> referring to something else.

I stand corrected. My confusion came from the removal of a later
*switch* to insertion sort in
a3f0b3d68f9a5357a3f72b40a45bcc714a9e0649, which also added the
pre-sorted check where you'd expect to see the insertion switch. Of
course, the n < 7 insertion sort thing is right beside the added
check. So another, redundant copy of insertion sort was removed, and
not the one that almost every real-world quicksort implementation has.

> Heap-sorting could also benefit from some optimization in this area.
> Right now, if you heap-sort data that's already in order, it gets
> progressively slower as you crank up work_mem, because the heap gets
> deeper and so extraction and siftup get more expensive, for no real
> benefit.  We could do something like check, every time we use up our
> available memory, whether the heap is already entirely in order.  If
> so, we could dump all but one tuple to the current run and the start
> reading tuples again.  Or maybe dump some percentage of the array,
> though that might require a bit more bookkeeping.  Or maybe there's a
> smarter algorithm that would also cater to mostly-sorted data.

Well, timsort is specifically designed to take advantage of pre-sorted
data. It does appear to have a lot of traction, as wikipedia points
out:

"Timsort has been Python's standard sorting algorithm since version
2.3. It is now also used to sort arrays in Java SE 7,[2] and on the
Android platform.[3] "

Somebody has produced a BSD-licensed C implementation that draws
heavily on the Python/Java one here:

http://code.google.com/p/timsort/source/browse/trunk/timSort.c?spec=svn17&r=17

It looks like it has exactly the same interface as a c stdlib qsort.
So it'd be fairly easy to produce a timsort_arg() based on this, if
anyone cares to.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2012-04-13 18:15:10 Re: [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.
Previous Message Robert Haas 2012-04-13 17:33:43 Re: Memory usage during sorting