Skip site navigation (1) Skip section navigation (2)

Re: [PERFORM] A Better External Sort?

From: PFC <lists(at)boutiquenumerique(dot)com>
To: rjpeace(at)earthlink(dot)net
Cc: "Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>,pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] A Better External Sort?
Date: 2005-09-29 16:10:29
Message-ID: op.sxvgjrceth1vuj@localhost (view raw, whole thread or download thread mbox)
Lists: pgsql-hackerspgsql-performance
	Just to add a little anarchy in your nice debate...

	Who really needs all the results of a sort on your terabyte table ?

	I guess not many people do a SELECT from such a table and want all the  
results. So, this leaves :
	- Really wanting all the results, to fetch using a cursor,
	- CLUSTER type things, where you really want everything in order,
	- Aggregates (Sort->GroupAggregate), which might really need to sort the  
whole table.
	- Complex queries where the whole dataset needs to be examined, in order  
to return a few values
	- Joins (again, the whole table is probably not going to be selected)
	- And the ones I forgot.


	Most likely you only want to SELECT N rows, in some ordering :
	- the first N (ORDER BY x LIMIT N)
	- and other variants

	Or, you are doing a Merge JOIN against some other table ; in that case,  
yes, you might need the whole sorted terabyte table, but most likely there  
are WHERE clauses in the query that restrict the set, and thus, maybe we  
can get some conditions or limit values on the column to sort.

	Also the new, optimized hash join, which is more memory efficient, might  
cover this case.

	Point is, sometimes, you only need part of the results of your sort. And  
the bigger the sort, the most likely it becomes that you only want part of  
the results. So, while we're in the fun hand-waving, new algorithm trying  
mode, why not consider this right from the start ? (I know I'm totally in  
hand-waving mode right now, so slap me if needed).

	I'd say your new, fancy sort algorithm needs a few more input values :

	- Range of values that must appear in the final result of the sort :
		none, minimum, maximum, both, or even a set of values from the other  
side of the join, hashed, or sorted.
	- LIMIT information (first N, last N, none)
	- Enhanced Limit information (first/last N values of the second column to  
sort, for each value of the first column) (the infamous "top10 by  
category" query)
	- etc.

	With this, the amount of data that needs to be kept in memory is  
dramatically reduced, from the whole table (even using your compressed  
keys, that's big) to something more manageable which will be closer to the  
size of the final result set which will be returned to the client, and  
avoid a lot of effort.

	So, this would not be useful in all cases, but when it applies, it would  
be really useful.

	Regards !


In response to

pgsql-performance by date

Next:From: PFCDate: 2005-09-29 16:12:52
Subject: Re: Comparative performance
Previous:From: Michael Ben-NesDate: 2005-09-29 14:25:49
Subject: Re: Advice on RAID card

pgsql-hackers by date

Next:From: Tom LaneDate: 2005-09-29 16:22:59
Subject: Re: pgbench: undefined reference to strndup()
Previous:From: Tino WildenhainDate: 2005-09-29 15:33:28
Subject: Re: postgresql clustering

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group