Re: new index type with clustering in mind.

From: "Jack Douglas" <jack(at)douglastechnology(dot)co(dot)uk>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Martijn van Oosterhout'" <kleptog(at)svana(dot)org>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: new index type with clustering in mind.
Date: 2014-06-03 07:24:58
Message-ID: 001b01cf7efc$ecfc9bb0$c6f5d310$@douglastechnology.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> > > To reduce complexity (eg MVCC/snapshot related issues), index entries
> > > would be added when a row is inserted, but they would not be removed
> > > when the row is updated/deleted (or when an insert is rolled back):

> > It's an interesting idea, but, how can you *ever* delete index entries?
> > I.e. is there a way to maintain the index without rebuilding it
> > regularly?
>
> The discussions at PGCon pointed out that with the posting-list
compression
> logic added in 9.4, GIN indexes are pretty close to this already.
Multiple
> items on the same heap page will typically only take one byte of index
space
> per item; but there is an identifiable entry, so you don't get into these
> questions of when VACUUM should remove entries, and it's not lossy so
> you're not forced to pay the overhead of rechecking every entry on the
> linked-to page.

The README file in the source code for GIN indexes says: "Note: There is no
delete operation in the key (entry) tree. The reason for this is that in our
experience, the set of distinct words in a large corpus changes very slowly.
This greatly simplifies the code and concurrency algorithms.". Does that
mean that for the case where GIN is used as a simple replacement for btree,
my initial suggestion above ("...would be added when a row is inserted, but
they would not be removed...") is effectively what happens already with GIN
indexes?

---

I've written up a test to demonstrate the principle of this 'clustering
lite': http://dba.stackexchange.com/q/66292/1396

In brief, it shows the benefit of this sort of clustering (much lower io
versus unclustered, and no exclusive lock or heavy WAL generation versus
current clustering implementations) with a workable way of achieving it in
the current release. The basic idea is:

1. turn off autovacuum for the table
2. check each block to determine the degree of clustering
3. delete and re-insert all the rows from blocks below a clustering
threshold
4. manually vacuum to free those (complete) blocks
5. repeat steps 2-4 as regularly as necessary

A weakness is that is requires a full table scan is required. All that is
needed to avoid the full-scan would be a way of getting `blkno` from an
index-only scan, which as far as I can tell is not currently possible, is
it?

In response to

Browse pgsql-general by date

  From Date Subject
Next Message sramay 2014-06-03 08:25:36 bytea Issue - Reg
Previous Message Kapil Agarwal 2014-06-03 01:34:29 Re: Archive# views