Re: [GSoC 2018] Proposal Draft

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Kefan Yang <starordust(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [GSoC 2018] Proposal Draft
Date: 2018-03-18 01:05:09
Message-ID: CAH2-WznNDrRcZqh3vRLn_K1xLUoUaUnipuUf62wJQuB7fEgEVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 17, 2018 at 5:34 PM, Kefan Yang <starordust(at)gmail(dot)com> wrote:
> I am Kefan Yang, a third-year Computing Science student from Simon Fraser
> University, Canada. I am very interested in the sorting algorithm
> benchmarking and implementation issue you mentioned on the idealist of
> Google Summer of Code 2018.

> I've attached a proposal draft to this email. It's a brief one but I guess
> it's enough to show my current ideas:)

The proposal reads:

"""
Industrial implementation of selected sorting algorithm:
The industrial version is basically an optimization based on the benchmark
implementation. I plan to use optimizations like checking if input
array is already sorted
or applying insertion sort directly for short arrays to see if the
performance can be
improved. I am still looking for other common optimization methods.

"""

But our qsort implementation, which is the Bentley & McIlroy
implementation with a couple of customizations, already does
everything you mention (I refer to the precheck algorithmic
customization, and the way that we check to see which side of a pivot
to use real recursion with to avoid stack overflows). I suggest that
you read the paper [1] -- the code that we use is almost directly
lifted from the paper. The opaque sounding variable names are the
same, and the C code from the paper is structured in exactly the same
way.

I think that this won't be a particularly easy project to get
committed. I suggest that if you go forward with it that you
specifically look into integrating "Dual-Pivot Quicksort" [2] as a
whole cloth replacement for the B&M implementation. It seems like this
has some chance of working out, because it's more or less acknowledged
by Bentley himself to be a kind of successor to his own industrial
quicksort implementation [3] -- it's derived from it. Note that Java 7
uses dual pivot quicksort when sorting integers.

In general, we've had a lot of success with optimizing sorting in the
past few years by focusing on things like avoiding cache misses in
comparators. There has been much less success with algorithmic
improvements, and no success at all with novel algorithmic
enhancements. PostgreSQL development just isn't a great way to do that
sort of thing.

BTW, I noticed that you go on to say this:

"""
However,
since disk operation is much expensive than in-memory sorting, I am
not sure if we can
observe a significant difference in this way.

"""

I think that you'd be surprised at how often this isn't true these
days, at least when sorting enough data for spilling to disk to be
relevant. The main reason for that is that the cost of writing out
runs increases linearly, and therefore eventually becomes very small
compared to the costs of sorting itself, which increases at a
linearithmic rate.

[1] http://cs.fit.edu/~pkc/classes/writing/samples/bentley93engineering.pdf
[2] http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf
[3] http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-September/002630.html
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-18 01:32:36 Re: strange failure in plpgsql_control tests (on fulmar, ICC 14.0.3)
Previous Message Kefan Yang 2018-03-18 00:34:11 [GSoC 2018] Proposal Draft