Re: No merge sort?

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Ron Peacetree" <rjpeace(at)earthlink(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: No merge sort?
Date: 2003-04-09 17:54:39
Message-ID: D90A5A6C612A39408103E6ECDD77B829408AC6@voyager.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Ron Peacetree [mailto:rjpeace(at)earthlink(dot)net]
> Sent: Tuesday, April 08, 2003 3:07 PM
> To: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] No merge sort?
>
>
> ""Dann Corbit"" <DCorbit(at)connx(dot)com> wrote in message
> news:D90A5A6C612A39408103E6ECDD77B829408AC4(at)voyager(dot)corporate(dot)connx(dot)co
> m...
> > There can be a problem here Distribution & Radix sorts. Imagine the
> following query:
> > SELECT company, division, plant, sum(revenue) FROM corporate GROUP
> BY
> > company, division, plant ORDER BY company, division, plant
> >
> > Let's suppose further that company, division, and plant are all char
> 20.
> > For a given section of company + division, the first 40 characters
> will
> > be the same. So a radix or distribution sort will need 40 passes
> over
> > the identical data before it even gets to the differentiating field
> > plant.
> >
> 1= Distribution sort does not have this problem. Any data
> that can be constrained by the "pigeon hole principle" can be
> sorted in o(n) time.

Distribution sort is not an algorithm. It is a general technique. Both
counting sort and radix sort are distribution sorts. I think you are
talking about counting sort here. In order to insert a technique into a
database, you must solve the general case.

>2= For Radis sort, that's iff ("if and
> only if") you use =characters= as the radix of a radix sort
> and do a MSB aka partition-exchange sort. The appropriate
> radix here is a =field= not a character. Since there are 3
> fields vs 60 characters the problem becomes 2/3 wasted passes
> instead of 40/60.

This is lunacy. If you count the field or use it as a radix, you will
need a radix of 40*8 bits = 320 bits. That means you will have 2^320 =
2.136e96 bins.

>We can do even better by making one field for
> company+division and one for plant. Now we have no wasted passes and
> an O(n) sort. Since PostgreSQL has wonderful built in
> support for string handling and regexp, this is clearly an
> easy to implement and superior approach.

You have no idea what you are talking about.

> Since a comparison based sorting algorithm would be best
> implemented by working on the same atomic units, the fair
> comparison becomes the
> log(n!) behavior of that sort vs the O(n) behavior of
> Distribution or Radix sort on data of that atomic size. The
> constants involved in each ordering operation are bigger for
> the two O(n) sorts, but despite that you'll find that for
> reasonable sizes of real world data, you'll complete the O(n)
> sorts 2-3X faster.

With 2e96 * 4 space requirement (assuming that 4 billion is the largest
row count allowed in the database). If we could build a memory that
large, how long would it take to reach the last bin? There are an
estimated 10^80 elementary particles in the universe.

> Of course, internal sorting performance is vastly dominated
> by disk I/O costs if there is even a moderate amount disk I/O
> involved. However, there are ways to combine decent internal
> sorting with efficient disk merging, getting the best
> possible out of both subsystems.

You are confusing terms here. An internal sort is performed 100% in
memory. It is an external sort that uses disk space.

> > If they are fixed width, we could use LSD radix sort and count all
> the
> > bins in one pass. But that technique will fail with varchar.
> > Naturally, most long character fields will be varchar. The basic
> upshot
> > is that radix and distribution sorts work best with short keys.
> >
> More accurate would be say that such searches work best with
> keys that have as little in common as possible to each other.
> Length isn't the problem, =the number of unique keys= and
> having few equal keys is. Large quantities of equal keys can
> trash the performance of many sorting algorithms, including
> quicksort, unless steps are taken to either modify the
> algorithm to handle it or the data to minimize such an occurrence.

I would guess that you have never done any sorting work.
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo(at)postgresql(dot)org
>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ron Peacetree 2003-04-09 19:38:13 Re: No merge sort?
Previous Message Ron Peacetree 2003-04-09 17:53:17 Re: Anyone working on better transaction locking?