Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, Greg S <stark(at)mit(dot)edu>
Subject: Re: Timsort performance, quicksort (was: Re: Memory usage during sorting)
Date: 2012-04-24 15:17:09
Message-ID: CA+TgmobKFcb6ajVEN8eSnBa78sB3FSwuEWTHXd2x9JC5DOh5OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 18, 2012 at 9:31 PM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
> Thoughts?

Interesting work. I thought about trying to code up timsort at one
point, but I've been running short of round tuits.

I did some quick tests of quicksort using half a million random
strings. On my MacBook Pro, if the strings are in random order,
quicksort takes about 12 seconds. If they are presorted, it takes
about 800 ms. If they are in sorted order with an empty string
appended onto the end, it takes about 25 seconds.

If I modify gen_qsort_tuple.pl to perform the "presorted" input check
only when n > 40, the for the
presorted-except-the-last-element-is-small test drops from 25 seconds
to about 21.5 seconds, without apparently harming either of the other
two cases. If I remove the check altogether, the time further drops
to about 13.5 seconds, the time to sort completely-ordered data rises
to about 10-10.5 seconds, and the time to sort randomly ordered data
still doesn't seem to change much.

Based on that, I'm inclined to propose rejiggering things so that the
presorted-input check runs only at the top level, and not during any
recursive steps. The theory is that it won't cost much because if the
data is randomly ordered we won't do many comparisons before falling
out, and that seems to be true. But the only point of doing it in the
recursive steps is to win when the data is partially ordered, and we
actually seem to be losing big-time in that case, perhaps because when
the data is partially ordered, the presort check will frequently to
run through a significant part of the array - wasting cycles - but
fall out before it actually gets to the end.

Of course, even if we did that, it might not be as good as your
timsort numbers, but that doesn't mean we shouldn't do it...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-04-24 16:00:16 Re: New sync commit mode remote_write
Previous Message Tom Lane 2012-04-24 14:11:26 Re: Usage of planner_ctx