Re: Feature suggestion : FAST CLUSTER

From: PFC <lists(at)peufeu(dot)com>
To: "Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Feature suggestion : FAST CLUSTER
Date: 2007-05-27 17:34:30
Message-ID: op.tszxrsircigqcu@apollo13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sun, 27 May 2007 17:53:38 +0200, Jim C. Nasby <decibel(at)decibel(dot)org>
wrote:

> On Tue, May 22, 2007 at 09:29:00AM +0200, PFC wrote:
>> This does not run a complete sort on the table. It would be about as
>> fast as your seq scan disk throughput. Obviously, the end result is
>> not as
>> good as a real CLUSTER since the table will be made up of several
>> ordered
>> chunks and a range lookup. Therefore, a range lookup on the clustered
>> columns would need at most N seeks, versus 1 for a really clustered
>> table.
>> But it only scans the table once and writes it once, even counting index
>> rebuild.
>
> Do you have any data that indicates such an arrangement would be
> substantially better than less-clustered data?

While the little benchmark that will answer your question is running,
I'll add a few comments :

I have been creating a new benchmark for PostgreSQL and MySQL, that I
will call the Forum Benchmark. It mimics the activity of a forum.
So far, I have got interesting results about Postgres and InnoDB and will
publish an extensive report with lots of nasty stuff in it, in, say, 2
weeks, since I'm doing this in spare time.

Anyway, forums like clustered tables, specifically clusteriing posts on
(topic_id, post_id), in order to be able to display a page with one disk
seek, instead of one seek per post.
PostgreSQL humiliates InnoDB on CPU-bound workloads (about 2x faster
since I run it on dual core ; InnoDB uses only one core). However, InnoDB
can automatically cluster tables without maintenance. This means InnoDB
will, even though it sucks and is awfully bloated, run a lot faster than
postgres if things become IO-bound, ie. if the dataset is larger than RAM.
Postgres needs to cluster the posts table in order to keep going. CLUSTER
is very slow. I tried inserting into a new posts table, ordering by
(post_id, topic_id), then renaming the new table in place of the old. It
is faster, but still slow when handling lots of data.
I am trying other approaches, some quite hack-ish, and will report my
findings.

Regards

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Alexander Staubo 2007-05-27 18:27:31 Re: Feature suggestion : FAST CLUSTER
Previous Message Stefan Kaltenbrunner 2007-05-27 16:34:22 Re: Domains versus Check Constraints