Re: Equivalent praxis to CLUSTERED INDEX?

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: "J(dot) Andrew Rogers" <jrogers(at)neopolitan(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, mischa(dot)sandberg(at)telus(dot)net
Subject: Re: Equivalent praxis to CLUSTERED INDEX?
Date: 2004-08-26 18:18:53
Message-ID: 200408261818.i7QIIrr09640@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


How do vendors actually implement auto-clustering? I assume they move
rows around during quiet periods or have lots of empty space in each
value bucket.

---------------------------------------------------------------------------

J. Andrew Rogers wrote:
> On Tue, 2004-08-24 at 22:28, Mischa Sandberg wrote:
> > I see that PG has a one-shot CLUSTER command, but doesn't support
> > continuously-updated clustered indexes.
> >
> > What I infer from newsgroup browsing is, such an index is impossible,
> > given the MVCC versioning of records (happy to learn I'm wrong).
>
>
> It is possible to have MVCC and ordered/indexed heaps, but it isn't
> something you can just tack onto the currently supported types -- I
> looked into this myself. It would take substantial additional code
> infrastructure to support it, basically an alternative heap system and
> adding support for tables with odd properties to many parts of the
> system. Pretty non-trivial.
>
> This is probably my #1 "I wish postgres had this feature" feature. It
> is a serious scalability enhancer for big systems and a pain to work
> around not having.
>
>
> > I'd be curious to know what other people, who've crossed this same
> > bridge from MSSQL or Oracle or Sybase to PG, have devised,
> > faced with the same kind of desired performance gain for retrieving
> > blocks of rows with the same partial key.
>
>
> The CLUSTER command is often virtually useless for precisely the kinds
> of tables that need to be clustered. My databases are on-line 24x7, and
> the tables that are ideal candidates for clustering are in the range of
> 50-100 million rows. I can afford to lock these tables up for no more
> than 5-10 minutes during off-peak in the hopes that no one notices, and
> CLUSTER does not work remotely in the ballpark of that fast for tables
> of that size. People who can run CLUSTER in a cron job must either have
> relatively small tables or regular large maintenance windows.
>
>
> My solution, which may or may not work for you, was to write a table
> partitioning system using the natural flexibility and programmability of
> postgresql (e.g. table inheritance). From this I automatically get a
> roughly ordered heap according to the index I would cluster on, with
> only slightly funky SQL access. The end result works much better with
> CLUSTER too, though CLUSTER is much less necessary at that point
> because, at least for my particular purposes, the rows are mostly
> ordered due to how the data was partitioned.
>
> So there are ways to work around CLUSTER, but you'll have to be clever
> and it will require tailoring the solution to your particular
> requirements.
>
>
> J. Andrew Rogers
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2004-08-26 18:20:20 Re: Disabling transaction/outdated-tuple behaviour
Previous Message Matt Clark 2004-08-26 18:18:52 Re: Disabling transaction/outdated-tuple behaviour