Re: No merge sort?

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Ron Peacetree" <rjpeace(at)earthlink(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: No merge sort?
Date: 2003-04-15 02:47:51
Message-ID: D90A5A6C612A39408103E6ECDD77B829408ACF@voyager.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Ron Peacetree [mailto:rjpeace(at)earthlink(dot)net]
> Sent: Friday, April 11, 2003 5:02 PM
> To: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] No merge sort?
>
>
> ""Dann Corbit"" <DCorbit(at)connx(dot)com> wrote in message
> news:D90A5A6C612A39408103E6ECDD77B829408ACB(at)voyager(dot)corporate(dot)connx(dot)co
> m...
> > > From: Ron Peacetree [mailto:rjpeace(at)earthlink(dot)net]=20
> > > Sent: Wednesday, April 09, 2003 12:38 PM
> > > You only need as many bins as you have unique key values=20 silly
> > > ;-) Remember, at its core Radix sort is just a=20 distribution
> > > counting sort (that's the name I learned for the=20 general
> > > technique). The simplest implementation uses bits as=20
> the atomic
> > > unit, but there's nothing saying you have to...=20=20 In a DB, we
> > > know all the values of the fields we currently=20 have
> stored in the
> > > DB. We can take advantage of that.
> >
> > By what magic do we know this? If a database knew a-priori what all
> the
> > distinct values were, it would indeed be excellent magic.
> For any table already in the DB, this is self evident. If
> you are talking about sorting data =before= it gets put into
> the DB (say for a load), then things are different, and the
> best you can do is used comparison based methods (and
> probably some reasonably sophisticated external merging
> routines in addition if the data set to be sorted and loaded
> is big enough). The original question as I understood it was
> about a sort as part of a query. That means everything to be
> sorted is in the DB, and we can take advantage of what we know.

The database does not know all the distinct values that it contains.

> > With 80 bytes, you have 2^320 possible values. There is no
> way around
> > that. If you are going to count them or use them as a
> radix, you will
> > have to classify them. The only way you will know how many unique
> > values you have in "Company+Division" is to ...
> > Either sort them or by some means discover all that are distinct
> Ummm, Indexes, particularly Primary Key Indexes, anyone?
> Finding the unique values in an index should be perceived as
> trivial... ...and you often have the index in memory for
> other reasons already.

If we have a unique clustered primary key index, why bother to sort?
The order by will be faster without sorting.

> > . The database does not know how many there are
> > beforehand. Indeed, there could be anywhere
> > from zero to 2^320 (given enough space) distinct values.
> >
> > I would be interested to see your algorithm that
> > performs a counting or radix sort on 320 bit bins and that works in
> > the general case without using extra space.
> >
> 1= See below. I clearly stated that we need to use extra
> space 2= If your definition of "general" is "we know nothing
> about the data", then of course any method based on using
> more sophisticated ordering operators than comparisons is
> severely hampered, if not doomed. This is not a comparison
> based method. You have to know some things about the data
> being sorted before you can use it. I've been =very= clear
> about that throughout this.
>
>
> > > Note TANSTAAFL (as Heinlein would've said): We have to use=20
> > >extra space for mapping the radix values to the radix
> keys,=20 and
> > >our "inner loop" for the sorting operation is=20
> considerably more
> > >complicated than that for a quicksort or a=20 mergesort.
> Hence the
> > >fact that even though this is an O(n)=20 technique, in real world
> > >terms you can expect only a 2-3X=20 performance
> improvement over say
> > >quicksort for realistic=20 amounts of data.
> > >=20
> > > Oh and FTR, IME for most "interesting" sorts in the DB=20
> > > domain, even for "internal" sorting techniques, the time to=20
> > > read data from disk and
> > > (possibly) write results back to disk tends to dominate the=20
> > > time spent actually doing the internal sort...
> > >=20
> Since you don't believe me, and have implied you wouldn't
> believe me even if I posted results of efforts on my part, go
> sort some 2GB files by implementing the algorithms in the
> sources I've given and as mentioned in this thread. (I'm
> assuming that you have access to "real" machines that can
> perform this as an internal sort, or we wouldn't be bothering
> to have this discussion). Then come back to the table if you
> still think there are open issues to be discussed.

I don't believe you and I am not going to try to code a sort that does
not work [and is demonstrably impossible]. But I will tell you what I
will do. I will post the binary for a Win32 sorting routine that I
wrote which will sort files of any size. If you can write a tool that
sorts faster (and does not rely on a primary key which would render the
task of sorting moot) then I will believe you.
Here is the routine:
ftp://cap.connx.com/chess-engines/new-approach/External.exe

The syntax to sort a text file with it is:
External <input_file> <output_file> <internal_sorting_buffer_size>

So (for instance) to sort a 100 gigabyte file with 1 gig of physical
memory allowed for the sort, the syntax would be:
External 100gig.dat 100gig.sor 1000000000

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2003-04-15 02:48:30 Re: Are we losing momentum?
Previous Message Curt Sampson 2003-04-15 02:36:07 Re: Are we losing momentum?