Skip site navigation (1) Skip section navigation (2)

Re: Feature suggestion : FAST CLUSTER

From: PFC <lists(at)peufeu(dot)com>
To: "Jim Nasby" <decibel(at)decibel(dot)org>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Feature suggestion : FAST CLUSTER
Date: 2007-05-29 08:43:56
Message-ID: op.ts2yjilecigqcu@apollo13 (view raw or flat)
Thread:
Lists: pgsql-performance
> I assume you meant topic_id, post_id. :)

	Um, yes ;)

> The problem with your proposal is that it does nothing to ensure that  
> posts for a topic stay together as soon as the table is large enough  
> that you can't sort it in a single pass. If you've got a long-running  
> thread, it's still going to get spread out throughout the table.

	I completely agree with you.
	However, you have to consider the use cases for clustered tables.

	Suppose you want to cluster on (A,B). A can be topic, category, month,  
store, whatever ; B can be post_id, product_id, etc.

	Now consider these cases :

1- Fully clustered table and defragmented file, TOAST table also in the  
same order as the main table.
CLUSTER does not do the second part, but [INSERT INTO new_table SELECT *  
 FROM old_table ORDER BY a,b] also fills the TOAST table in the right order.

2- Totally unclustered

3- InnoDB which is an index-table ie. a BTree with data in the leafs ;  
this means clustering is automatic

4- My partial cluster proposal, ie. the table and its TOAST table have  
been clustered in chunks, say, of 500 MB.

	* You always want to get ALL records with a specific value of A.

In this case, a fully clustered table will obviously be the best choice :  
1 seek, then seq scan, Bitmap Index Scan rules.
You might think that the InnoDB case would perform the same.
However, after some time the table and files on disk will be very  
fragmented, and btree pages with the same value of A will be everywhere,  
as they have been splitted and joined by insertions and deletions, so your  
theoretical sequential scan might well translate into 1 seek per page.  
Obviously since all the rows in the page will be of interest to you, this  
is still better than 1 seek per row, but well, you get the idea.

	* You want records with a specific value of A, and B inside a range

	Example :
	- topic_id = X AND post_id between start_of_page and end_of_page
	- get the sales record for january grouped by store
	- get the first 10 comments of a blog post
	- etc
	I would bet that this use case happens much more often than the previous  
one.

In this case, a fully clustered table will obviously, again, be the best  
choice.
Randomly "organized" table will cost 1 seek per row, ie. buy more RAM or  
get fired.
InnoDB will not work that bad, the number of seeks will be (number of rows  
wanted) / (rows per page)

	However, how would the chunked clustered case work ?

	In the worst case, if the table has been sorted in N "chunks", you'll  
need N seeks.
	However, since people generally cluster their stuff on some sort of  
temporal related column (posts and comments are sorted by insertion time)  
you can safely bet that most of the rows you want will end up in the same  
chunk, or in maybe 2-3 chunks, reducing the number of seeks to something  
between 1 and the number of chunks.

	The fact is, sometimes people don't use CLUSTER when it would really help  
because it takes too long.
	
	Suppose you have a 3GB table, and your work_mem is 512MB.

	CLUSTER will take forever.
	A hypothetical new implementation of CLUSTER which would do an on-disk  
sort would create several sort bins, then combine them.
	A chunked cluster like I'm proposing would be several times faster since  
it would roughly operate at the raw disk IO speed (Postgres sorting is so  
fast when in RAM...)

	So, having a full and chunked cluster would allow users to run the full  
cluster maybe once a month, and the chunked cluster maybe once every few  
days.
	And the chunked CLUSTER would find most of the rows already in order, so  
its end result would be very close to a full CLUSTER, with a fraction of  
the runtime.

	An added bonus is for index rebuild.
	Instead of first, clustering the table, then rebuilding the indexes, this  
could be done :

	- initialize sort buffers for the index builds
	- loop :
		- grab 500 MB of data from the table
		- sort it
		- insert it into new table
		- while data is still in RAM, extract the indexed columns and shove them  
into each index's sort buffer
		- repeat until all data is processed

	- now, you have all indexed columns ready to be used, without need to  
rescan the table ! index rebuild will be much faster.

	I see VACUUM FULL is scheduled for a reimplementation in a future version.
	This could be the way : with the same code path, this could do VACUUM  
FULL, CLUSTER, and chunked CLUSTER, just by changing if/how the sort step  
is done, giving the user a nice performance / maintenance time tradeoff.
	Getting this in maybe, 8.5 is better than CLUSTER CONCURRENTLY in 8.10 I  
would dare say.
	And the original table can still be read from.

	Benchmarks are running. I will back this with figures in a few days.

	Anoher thing to do for CLUSTER would be to cluster only the tail of the  
table.

	Have a nice day !


>
> What you really want is CLUSTER CONCURRENTLY, which I believe is on the  
> TODO list. BUT... there's another caveat here: for any post where the  
> row ends up being larger than 2k, the text is going to get TOASTed  
> anyway, which means it's going to be in a separate table, in a different  
> ordering. I don't know of a good way to address that; you can cluster  
> the toast table, but you'll be clustering on an OID, which isn't going  
> to help you.
> --
> Jim Nasby                                            jim(at)nasby(dot)net
> EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match



In response to

pgsql-performance by date

Next:From: Merlin MoncureDate: 2007-05-29 13:06:56
Subject: Re: PITR performance costs
Previous:From: PFCDate: 2007-05-29 06:23:52
Subject: Re: Feature suggestion : FAST CLUSTER

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group