Re: [PERFORM] CLUSTER command

From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: johnnnnnn <john(at)phaedrusdeinus(dot)org>
Cc: <pgsql-performance(at)postgresql(dot)org>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: [PERFORM] CLUSTER command
Date: 2002-12-13 00:03:47
Message-ID: 20021212154146.T13718-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-interfaces pgsql-performance

On Thu, 12 Dec 2002, johnnnnnn wrote:

> On Thu, Dec 12, 2002 at 05:39:44PM -0500, Jean-Luc Lachance wrote:
> > Let's create a new command:
> >
> > PARTITION <table> ON <attribute>
> <snip>
> > Because the table file is already extended (2G limit) using
> > different files extension (.N)
> > how complicated (modifying the code) would it be to have the table
> > files split according to the cluster key?
>

> I think the code changes would be complicated. Just at a 30-second
> consideration, this would need to touch:
> - all sql (selects, inserts, updates, deletes)
> - vacuuming
> - indexing
> - statistics gathering
> - existing clustering

I think his idea was to treat it similarly to the way that the
system treats tables >2G with .N files. The only thing is that
I believe the code that deals with that wouldn't be particularly
easy to change to do it though, but I've only taken a cursory look at
what I think is the place that does that(storage/smgr/md.c). Some sort of
good partitioning system would be nice though.

> create table u1 (...);
> create table u2 (...);
> create table u3 (...);
>
> create view uv as (select "A" as partition_key, ... from u1
> union all
> select "B" as partition_key, ... from u2
> union all
> select "C" as partition_key, ... from u3);
>
> That keeps the tables in different files on-disk while still allowing
> you to query against all of them. You need to index them separately
> and logic is necessary when changing data.

Unfortunately, I think that the optimizer isn't going to do what you'd
hope here and scan only the appropriate table if you were to say
partition_key='A' and foo='bar'. I'd love to be shown that I'm wrong, but
the best I could see hoping for would be that if partition_key was part of
u1-u3 and there was an index on partition_key,foo that it could use that
and do minimal work on the other tables.

In addition, doing something like the above is a nightmare if you don't
know beforehand what the partitions should be (for example if you know
there aren't alot of distinct values, but you don't know what they are) or
for that matter even with 10-15 partitions, writing the rules and such
would probably be really error prone.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2002-12-13 00:47:19 Re: [PERFORM] CLUSTER command
Previous Message Steve Crawford 2002-12-12 23:51:52 Grant issues

Browse pgsql-interfaces by date

  From Date Subject
Next Message Alvaro Herrera 2002-12-13 00:47:19 Re: [PERFORM] CLUSTER command
Previous Message johnnnnnn 2002-12-12 23:00:02 Re: [GENERAL] CLUSTER command

Browse pgsql-performance by date

  From Date Subject
Next Message Alvaro Herrera 2002-12-13 00:47:19 Re: [PERFORM] CLUSTER command
Previous Message johnnnnnn 2002-12-12 23:00:02 Re: [GENERAL] CLUSTER command