Re: [PERFORM] A Better External Sort?

From: PFC <lists(at)boutiquenumerique(dot)com>
To: rjpeace(at)earthlink(dot)net
Cc: "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] A Better External Sort?
Date: 2005-09-29 16:10:29
Message-ID: op.sxvgjrceth1vuj@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance


Just to add a little anarchy in your nice debate...

Who really needs all the results of a sort on your terabyte table ?

I guess not many people do a SELECT from such a table and want all the
results. So, this leaves :
- Really wanting all the results, to fetch using a cursor,
- CLUSTER type things, where you really want everything in order,
- Aggregates (Sort->GroupAggregate), which might really need to sort the
whole table.
- Complex queries where the whole dataset needs to be examined, in order
to return a few values
- Joins (again, the whole table is probably not going to be selected)
- And the ones I forgot.

However,

Most likely you only want to SELECT N rows, in some ordering :
- the first N (ORDER BY x LIMIT N)
- last N (ORDER BY x DESC LIMIT N)
- WHERE x>value ORDER BY x LIMIT N
- WHERE x<value ORDER BY x DESC LIMIT N
- and other variants

Or, you are doing a Merge JOIN against some other table ; in that case,
yes, you might need the whole sorted terabyte table, but most likely there
are WHERE clauses in the query that restrict the set, and thus, maybe we
can get some conditions or limit values on the column to sort.

Also the new, optimized hash join, which is more memory efficient, might
cover this case.

Point is, sometimes, you only need part of the results of your sort. And
the bigger the sort, the most likely it becomes that you only want part of
the results. So, while we're in the fun hand-waving, new algorithm trying
mode, why not consider this right from the start ? (I know I'm totally in
hand-waving mode right now, so slap me if needed).

I'd say your new, fancy sort algorithm needs a few more input values :

- Range of values that must appear in the final result of the sort :
none, minimum, maximum, both, or even a set of values from the other
side of the join, hashed, or sorted.
- LIMIT information (first N, last N, none)
- Enhanced Limit information (first/last N values of the second column to
sort, for each value of the first column) (the infamous "top10 by
category" query)
- etc.

With this, the amount of data that needs to be kept in memory is
dramatically reduced, from the whole table (even using your compressed
keys, that's big) to something more manageable which will be closer to the
size of the final result set which will be returned to the client, and
avoid a lot of effort.

So, this would not be useful in all cases, but when it applies, it would
be really useful.

Regards !

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-09-29 16:22:59 Re: pgbench: undefined reference to strndup()
Previous Message Tino Wildenhain 2005-09-29 15:33:28 Re: postgresql clustering

Browse pgsql-performance by date

  From Date Subject
Next Message PFC 2005-09-29 16:12:52 Re: Comparative performance
Previous Message Michael Ben-Nes 2005-09-29 14:25:49 Re: Advice on RAID card