Re: Feature suggestion : FAST CLUSTER

From: PFC <lists(at)peufeu(dot)com>
To: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Feature suggestion : FAST CLUSTER
Date: 2007-05-29 06:23:52
Message-ID: op.ts2r12wtcigqcu@apollo13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sun, 27 May 2007 19:34:30 +0200, PFC <lists(at)peufeu(dot)com> wrote:

> On Sun, 27 May 2007 17:53:38 +0200, Jim C. Nasby <decibel(at)decibel(dot)org>
> wrote:
>
>> On Tue, May 22, 2007 at 09:29:00AM +0200, PFC wrote:
>>> This does not run a complete sort on the table. It would be about as
>>> fast as your seq scan disk throughput. Obviously, the end result is
>>> not as
>>> good as a real CLUSTER since the table will be made up of several
>>> ordered
>>> chunks and a range lookup. Therefore, a range lookup on the clustered
>>> columns would need at most N seeks, versus 1 for a really clustered
>>> table.
>>> But it only scans the table once and writes it once, even counting
>>> index
>>> rebuild.
>>
>> Do you have any data that indicates such an arrangement would be
>> substantially better than less-clustered data?
>
> While the little benchmark that will answer your question is running,
> I'll add a few comments :

Alright, so far :

This is a simulated forum workload, so it's mostly post insertions, some
edits, and some topic deletes.
It will give results applicable to forums, obviously, but also anything
that wotks on the same schema :
- topics + posts
- blog articles + coomments
- e-commerce site where users can enter their reviews
So, the new trend being to let the users to participate, this kind of
workload will become more and more relevant for websites.

So, how to cluster the posts table on (topic_id, post_id) to get all the
posts on the same webpake in 1 seek ?

I am benchmarking the following :
- CLUSTER obviously
- Creating a new table and INSERT .. SELECT ORDER BY topic_id, post_id,
then reindexing etc
- not doing anything (just vacuuming all tables)
- not even vacuuming the posts table.

I al also trying the following more exotic approaches :

* chunked sort :

Well, sorting 1GB of data when your work_mem is only 512 MB needs several
passes, hence a lot of disk IO. The more data, the more IO.
So, instead of doing this, I will :
- grab about 250 MB of posts from the table
- sort them by (topic_id, post_id)
- insert them in a new table
- repeat
- then reindex, etc and replace old table with new.
(reindex is very fast, since the table is nicely defragmented now, I get
full disk speed. However I would like being able to create 2 indexes with
ONE table scan !)
I'm trying 2 different ways to do that, with plpgsql and cursors.
It is much faster than sorting the whole data set, because the sorts are
only done in memory (hence the "chunks")
So far, it seems a database clustered this way is about as fast as using
CLUSTER, but the clustering operation is faster.
More results in about 3 days when the benchmarks finish.

* other dumb stuff

I'll try DELETing the last 250MB of records, stuff them in a temp table,
vacuum, and re-insert them in order.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message PFC 2007-05-29 08:43:56 Re: Feature suggestion : FAST CLUSTER
Previous Message Jim Nasby 2007-05-29 00:48:00 Re: Feature suggestion : FAST CLUSTER