Re: cluster index on a table

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: cluster index on a table
Date: 2009-07-17 00:02:18
Message-ID: C6850D9A.A49D%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On 7/16/09 1:49 PM, "Greg Stark" <gsstark(at)mit(dot)edu> wrote:

> On Thu, Jul 16, 2009 at 9:06 PM, Scott Carey<scott(at)richrelevance(dot)com> wrote:
>> Keep the old page around or a copy of it that old transactions reference?
>> Just more Copy on Write.
>> How is that different from a nested loop on an index scan/seek currently?
>> Doesn't an old transaction have to reference an old heap page through an
>> index with the current implementation?  At least, the index references
>> multiple versions and their visibility must be checked.  Can a similar
>> solution work here?  Just reference the pre and post split pages and filter
>> by visibility?
>
> That would be a completely different architecture than we have now.
> We're not Oracle, Postgres does all this at the tuple level, not at
> the page level. We have tuple versions, not page versions, and tuple
> locks, not page locks.

Yes, that is a little more tricky, but I still don't see why that is an
issue. The Copy on Write I was referring to is at the tuple level.

>
> The old transaction has to reference a heap page through an index with
> the current implementation.

And so it could have a reference to an clustered index leaf page with tuples
in it through an index. Essentially, clustered index leaf pages are a
special type of heap-like page.

> But it can do so safely precisely because
> the tuple will be where the index references it as long as necessary.

Can't the same guarantee be made for clustered index data? When one splits,
keep the old one around as long as necessary.

> As long as that old transaction is live it's guaranteed not to be
> removed by vacuum (well... except by VACUUM FULL but that's a whole
> nother story).
>
> Actually this is probably the clearest problem with IOT in the
> Postgres universe. What do other indexes use to reference these rows
> if they can move around?

Indexes would point to a heap page for normal tables and clustered index
pages for clustered tables. When new versions of data come in, it may point
to new clustered index pages, just like they currently get modified to point
to new heap pages with new data. A split just means more data is modified.
And if multiple versions are still 'live' they point to multiple versions --
just as they now have to with the ordinary heap pages. Page splitting only
means that there might be some copies of row versions with the same tid due
to a split, but labeling a page with a 'creation' tid to differentiate the
pre and post split pages removes the ambiguity. I suppose that would
require some page locking.

>
> I wanted to call Heikki's "grouped index item" patch that he worked on
> for so long index organized tables. Basically that's what they are
> except the leaf tuples are stored in the regular heap like always,
> hopefully in index order. And there are no leaf tuples in the index so
> the actual index is much much smaller. It doesn't look like a
> traditional IOT but it behaves a lot like one in the space savings and
> access patterns.

Sounds like its a best-effort to maintain the heap in the index order, using
FILLFACTOR and dead space to hopefully keep it in roughly the right order.
That is an incremental improvement on the current situation.

For clustered indexes, clustered index pages would need to be treated like
the current heap pages and tuples with respect to copy on write. Yes, that
is more than a trivial modification of the current heap or index page types
and their MVCC semantics. Its definitely a third type of page.

Am I missing something?

>
> --
> greg
> http://mit.edu/~gsstark/resume.pdf
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Stark 2009-07-17 00:27:35 Re: cluster index on a table
Previous Message Kevin Grittner 2009-07-16 22:30:17 Re: Very big insert/join performance problem (bacula)