Re: [PERFORM] Releasing memory during External sorting?

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Ron Peacetree" <rjpeace(at)earthlink(dot)net>, "Mark Lewis" <mark(dot)lewis(at)mir3(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: [PERFORM] Releasing memory during External sorting?
Date: 2005-09-23 21:20:39
Message-ID: D425483C2C5C9F49B5B7A41F8944154757D102@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

For the subfiles, load the top element of each subfile into a priority
queue. Extract the min element and write it to disk. If the next value
is the same, then the queue does not need to be adjusted. If the next
value in the subfile changes, then adjust it.

Then, when the lowest element in the priority queue changes, adjust the
queue.

Keep doing that until the queue is empty.

You can create all the subfiles in one pass over the data.

You can read all the subfiles, merge them, and write them out in a
second pass (no matter how many of them there are).

Replacement selection is not a good idea any more, since obvious better
ideas should take over. Longer runs are of no value if you do not have
to do multiple merge passes.

I have explained this general technique in the book "C Unleashed",
chapter 13.

Sample code is available on the book's home page.

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> owner(at)postgresql(dot)org] On Behalf Of Ron Peacetree
> Sent: Friday, September 23, 2005 11:41 AM
> To: Mark Lewis; Tom Lane; pgsql-hackers(at)postgresql(dot)org; pgsql-
> performance(at)postgresql(dot)org
> Subject: Re: [HACKERS] [PERFORM] Releasing memory during External
sorting?
>
> Yep. Also, bear in mind that the lg(n!)= ~ nlgn - n lower bound on
> the number of comparisions:
> a= says nothing about the amount of data movement used.
> b= only holds for generic comparison based sorting algorithms.
>
> As Knuth says (vol 3, p180), Distribution Counting sorts without
> ever comparing elements to each other at all, and so does Radix
> Sort. Similar comments can be found in many algorithms texts.
>
> Any time we know that the range of the data to be sorted is
substantially
> restricted compared to the number of items to be sorted, we can sort
in
> less than O(lg(n!)) time. DB fields tend to take on few values and
are
> therefore "substantially restricted".
>
> Given the proper resources and algorithms, O(n) sorts are very
plausible
> when sorting DB records.
>
> All of the fastest external sorts of the last decade or so take
advantage
> of
> this. Check out that URL I posted.
>
> Ron
>
>
> -----Original Message-----
> From: Mark Lewis <mark(dot)lewis(at)mir3(dot)com>
> Sent: Sep 23, 2005 1:43 PM
> To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Subject: Re: [PERFORM] Releasing memory during External sorting?
>
> operations != passes. If you were clever, you could probably write a
> modified bubble-sort algorithm that only made 2 passes. A pass is a
> disk scan, operations are then performed (hopefully in memory) on what
> you read from the disk. So there's no theoretical log N lower-bound
on
> the number of disk passes.
>
> Not that I have anything else useful to add to this discussion, just a
> tidbit I remembered from my CS classes back in college :)
>
> -- Mark
>
> On Fri, 2005-09-23 at 13:17 -0400, Tom Lane wrote:
> > Ron Peacetree <rjpeace(at)earthlink(dot)net> writes:
> > > 2= No optimal external sorting algorithm should use more than 2
> passes.
> > > 3= Optimal external sorting algorithms should use 1 pass if at all
> possible.
> >
> > A comparison-based sort must use at least N log N operations, so it
> > would appear to me that if you haven't got approximately log N
passes
> > then your algorithm doesn't work.
> >
> > regards, tom lane
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that
your
> message can get through to the mailing list cleanly

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-09-23 21:31:37 Re: Patching dblink.c to avoid warning about open transaction
Previous Message Bruce Momjian 2005-09-23 20:47:38 Re: pg_dump fails to set index ownership