Re: [GSoC 2018] Proposal Draft

From: Kefan Yang <starordust(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [GSoC 2018] Proposal Draft
Date: 2018-03-18 02:00:21
Message-ID: CAF448YJhr0q5N45wVfars29G+h7VY9qBVnXFE1eWdp18X+xQww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for your quick feedback!

"""
Industrial implementation of selected sorting algorithm:
The industrial version is basically an optimization based on the benchmark
implementation. I plan to use optimizations like checking if input
array is already sorted
or applying insertion sort directly for short arrays to see if the
performance can be
improved. I am still looking for other common optimization methods.

"""
What I am trying to say here is that similar optimizations can be applied
to novel algorithms or other implementations of quicksort.

The paper about "Dual-Pivot Quicksort" is really helpful and it has a Java
implementation already. I can definitely make use of that.

Also, I am wondering what's the normal approach to testing if a certain
sorting implementation brings performance gain in this project? More
specifically, You mentioned that little progress has been made with novel
algorithmic enhancement. How can we get that conclusion?

I am still working on my proposal and I will post a new version soon:)

Regards,
Kefan

2018-03-17 18:05 GMT-07:00 Peter Geoghegan <pg(at)bowt(dot)ie>:

> On Sat, Mar 17, 2018 at 5:34 PM, Kefan Yang <starordust(at)gmail(dot)com> wrote:
> > I am Kefan Yang, a third-year Computing Science student from Simon Fraser
> > University, Canada. I am very interested in the sorting algorithm
> > benchmarking and implementation issue you mentioned on the idealist of
> > Google Summer of Code 2018.
>
> > I've attached a proposal draft to this email. It's a brief one but I
> guess
> > it's enough to show my current ideas:)
>
> The proposal reads:
>
> """
> Industrial implementation of selected sorting algorithm:
> The industrial version is basically an optimization based on the benchmark
> implementation. I plan to use optimizations like checking if input
> array is already sorted
> or applying insertion sort directly for short arrays to see if the
> performance can be
> improved. I am still looking for other common optimization methods.
>
> """
>
> But our qsort implementation, which is the Bentley & McIlroy
> implementation with a couple of customizations, already does
> everything you mention (I refer to the precheck algorithmic
> customization, and the way that we check to see which side of a pivot
> to use real recursion with to avoid stack overflows). I suggest that
> you read the paper [1] -- the code that we use is almost directly
> lifted from the paper. The opaque sounding variable names are the
> same, and the C code from the paper is structured in exactly the same
> way.
>
> I think that this won't be a particularly easy project to get
> committed. I suggest that if you go forward with it that you
> specifically look into integrating "Dual-Pivot Quicksort" [2] as a
> whole cloth replacement for the B&M implementation. It seems like this
> has some chance of working out, because it's more or less acknowledged
> by Bentley himself to be a kind of successor to his own industrial
> quicksort implementation [3] -- it's derived from it. Note that Java 7
> uses dual pivot quicksort when sorting integers.
>
> In general, we've had a lot of success with optimizing sorting in the
> past few years by focusing on things like avoiding cache misses in
> comparators. There has been much less success with algorithmic
> improvements, and no success at all with novel algorithmic
> enhancements. PostgreSQL development just isn't a great way to do that
> sort of thing.
>
> BTW, I noticed that you go on to say this:
>
> """
> However,
> since disk operation is much expensive than in-memory sorting, I am
> not sure if we can
> observe a significant difference in this way.
>
> """
>
> I think that you'd be surprised at how often this isn't true these
> days, at least when sorting enough data for spilling to disk to be
> relevant. The main reason for that is that the cost of writing out
> runs increases linearly, and therefore eventually becomes very small
> compared to the costs of sorting itself, which increases at a
> linearithmic rate.
>
> [1] http://cs.fit.edu/~pkc/classes/writing/samples/
> bentley93engineering.pdf
> [2] http://codeblab.com/wp-content/uploads/2009/09/DualPivotQuicksort.pdf
> [3] http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-
> September/002630.html
> --
> Peter Geoghegan
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-18 02:04:44 Re: Precision loss casting float to numeric
Previous Message Tom Lane 2018-03-18 01:37:22 Re: Precision loss casting float to numeric