Re: Minor performance improvement in transition to external sort

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Jeremy Harris <jgh(at)wizmail(dot)org>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minor performance improvement in transition to external sort
Date: 2014-02-06 18:21:16
Message-ID: CAMkU=1wc7_mtc_2QvGxoLZEGBkd=D0dnwNKZ4C+ifSKu9e+imQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 4, 2014 at 2:22 PM, Jeremy Harris <jgh(at)wizmail(dot)org> wrote:

> The attached patch replaces the existing siftup method for heapify with
> a siftdown method. Tested with random integers it does 18% fewer
> compares and takes 10% less time for the heapify, over the work_mem
> range 1024 to 1048576.
>

Thanks for working on this. Several times I've wondered why we didn't use
siftdown for heapifying, as it is a well known optimization. How big of
sets were you sorting in each case? Was it always just slightly bigger
than would fit in work_mem? Did you try sorting already-sorted, reverse
sorted, or pipe-organ shaped data sets? We will also need to test it on
strings. I usually use md5(random()::text) to generate strings for such
purposes, at least for a first pass.

The name of the attached patch is a bit unfortunate. And I'm not sure what
you are doing with tupindex, the change there doesn't seem to be necessary
to the rest of your work, so some explanation on that would be helpful.

The project coding style usually has comments in blocks before loops on
which they comment, rather than interspersed throughout the clauses of the
"for" statement.

>
> Both algorithms appear to be O(n) (contradicting Wikipedia's claim
> that a siftup heapify is O(n log n)).

Siftdown is linear in all cases. Siftup may be linear in the typical case,
but is n log n in the worst case.

Another optimization that is possible is to do only one comparison at each
level (between the two children) all the way down the heap, and then
compare the parent to the chain of smallest children in reverse order.
This can substantially reduce the number of comparisons on average,
because most tuples sift most of the way down the heap (because most of the
tuples are at the bottom of the heap). Robert submitted a patch to do this
in the main siftdown routine (which for some reason is
named tuplesort_heap_siftup, rather than siftdown), and I found it a
substantial improvement but it never got pushed forward. I think Robert
was concerned it might make some obscure cases worse rather than better,
and no one put it through enough serious testing to overcome that concern.

(Also, I think you should make your siftdown code look more like the
existing siftdown code.)

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2014-02-06 18:29:31 Re: extension_control_path
Previous Message Sawada Masahiko 2014-02-06 18:20:53 'dml' value for log_statement