Re: [PERFORM] Releasing memory during External sorting?

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Dann Corbit" <DCorbit(at)connx(dot)com>, "Ron Peacetree" <rjpeace(at)earthlink(dot)net>, "Mark Lewis" <mark(dot)lewis(at)mir3(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [PERFORM] Releasing memory during External sorting?
Date: 2005-09-23 21:38:49
Message-ID: D425483C2C5C9F49B5B7A41F8944154757D103@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The cited book also explains how to use a callback function to perform
arbitrary radix sorts (you simply need a method that returns the
[bucketsize] most significant bits for a given data type, for the length
of the key).

So you can sort fairly arbitrary data in linear time (of course if the
key is long then O(n*log(n)) will be better anyway.)

But in any case, if we are talking about external sorting, then disk
time will be so totally dominant that the choice of algorithm is
practically irrelevant.

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> owner(at)postgresql(dot)org] On Behalf Of Dann Corbit
> Sent: Friday, September 23, 2005 2:21 PM
> To: Ron Peacetree; Mark Lewis; Tom Lane; pgsql-hackers(at)postgresql(dot)org;
> pgsql-performance(at)postgresql(dot)org
> Subject: Re: [HACKERS] [PERFORM] Releasing memory during External
sorting?
>
> For the subfiles, load the top element of each subfile into a priority
> queue. Extract the min element and write it to disk. If the next
value
> is the same, then the queue does not need to be adjusted. If the next
> value in the subfile changes, then adjust it.
>
> Then, when the lowest element in the priority queue changes, adjust
the
> queue.
>
> Keep doing that until the queue is empty.
>
> You can create all the subfiles in one pass over the data.
>
> You can read all the subfiles, merge them, and write them out in a
> second pass (no matter how many of them there are).
>
> Replacement selection is not a good idea any more, since obvious
better
> ideas should take over. Longer runs are of no value if you do not
have
> to do multiple merge passes.
>
> I have explained this general technique in the book "C Unleashed",
> chapter 13.
>
> Sample code is available on the book's home page.
>
> > -----Original Message-----
> > From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> > owner(at)postgresql(dot)org] On Behalf Of Ron Peacetree
> > Sent: Friday, September 23, 2005 11:41 AM
> > To: Mark Lewis; Tom Lane; pgsql-hackers(at)postgresql(dot)org; pgsql-
> > performance(at)postgresql(dot)org
> > Subject: Re: [HACKERS] [PERFORM] Releasing memory during External
> sorting?
> >
> > Yep. Also, bear in mind that the lg(n!)= ~ nlgn - n lower bound on
> > the number of comparisions:
> > a= says nothing about the amount of data movement used.
> > b= only holds for generic comparison based sorting algorithms.
> >
> > As Knuth says (vol 3, p180), Distribution Counting sorts without
> > ever comparing elements to each other at all, and so does Radix
> > Sort. Similar comments can be found in many algorithms texts.
> >
> > Any time we know that the range of the data to be sorted is
> substantially
> > restricted compared to the number of items to be sorted, we can sort
> in
> > less than O(lg(n!)) time. DB fields tend to take on few values and
> are
> > therefore "substantially restricted".
> >
> > Given the proper resources and algorithms, O(n) sorts are very
> plausible
> > when sorting DB records.
> >
> > All of the fastest external sorts of the last decade or so take
> advantage
> > of
> > this. Check out that URL I posted.
> >
> > Ron
> >
> >
> > -----Original Message-----
> > From: Mark Lewis <mark(dot)lewis(at)mir3(dot)com>
> > Sent: Sep 23, 2005 1:43 PM
> > To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> > Subject: Re: [PERFORM] Releasing memory during External sorting?
> >
> > operations != passes. If you were clever, you could probably write
a
> > modified bubble-sort algorithm that only made 2 passes. A pass is a
> > disk scan, operations are then performed (hopefully in memory) on
what
> > you read from the disk. So there's no theoretical log N lower-bound
> on
> > the number of disk passes.
> >
> > Not that I have anything else useful to add to this discussion, just
a
> > tidbit I remembered from my CS classes back in college :)
> >
> > -- Mark
> >
> > On Fri, 2005-09-23 at 13:17 -0400, Tom Lane wrote:
> > > Ron Peacetree <rjpeace(at)earthlink(dot)net> writes:
> > > > 2= No optimal external sorting algorithm should use more than 2
> > passes.
> > > > 3= Optimal external sorting algorithms should use 1 pass if at
all
> > possible.
> > >
> > > A comparison-based sort must use at least N log N operations, so
it
> > > would appear to me that if you haven't got approximately log N
> passes
> > > then your algorithm doesn't work.
> > >
> > > regards, tom lane
> >
> > ---------------------------(end of
> broadcast)---------------------------
> > TIP 1: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo(at)postgresql(dot)org so that
> your
> > message can get through to the mailing list cleanly
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that
your
> message can get through to the mailing list cleanly

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-09-23 21:39:15 Re: Patching dblink.c to avoid warning about open transaction
Previous Message Bruce Momjian 2005-09-23 21:34:47 Re: ALTER ROLES - questions