new index type with clustering in mind.

From: "Jack Douglas" <jack(at)douglastechnology(dot)co(dot)uk>
To: <pgsql-general(at)postgresql(dot)org>
Subject: new index type with clustering in mind.
Date: 2014-05-24 16:58:37
Message-ID: 002f01cf7771$68407be0$38c173a0$@douglastechnology.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi

Please forgive this question if it is naïve.

Would the following be practical to implement:

A btree-like index type that points to *pages* rather than individual rows.
Ie if there are many rows in a page with the same data (in the indexed
columns), only one index entry will exist. In its normal use case, this
index would be much smaller than a regular index on the same columns which
would contain one entry for each individual row.

To reduce complexity (eg MVCC/snapshot related issues), index entries would
be added when a row is inserted, but they would not be removed when the row
is updated/deleted (or when an insert is rolled back): this would cause
index bloat over time in volatile tables but this would be acceptable for
the use case I have in mind. So in essence, an entry in the index would
indicate that there *may* be matching rows in the page, not that there
actually are.

For data where there is a high degree of clustering on the columns of the
index, this index would be quite efficient at locating matching rows: the
index itself would be small and most pages would contain largely rows that
match. (In practice it would behave similar to a ‘cluster’ index in Oracle,
without the guarantee of clustering).

This brings me on to my use case: a method to efficiently keep a table
largely clustered (with ‘clustered’ more loosely defined than the usual
Postgres definition) without locking more than a page at a time. The above
index could be scanned periodically for entries where multiple keys point to
the same page. Of that set of keys/pages, the subset where the same key
appears in multiple pages can be clustered by simply deleting and
reinserting the rows, key by key. In this case, the pages affected by the
deletes could be locked briefly in order to safely delete the corresponding
index entries without conflict with other transactions. If a lock can’t be
obtained the process could simply skip and move on to the next candidate
key/pages.

Although this kind of ‘clustered’ data is less ordered that a normal
Postgres cluster, it retains some of the useful properties: particularly
that data for a particular key is likely to be found in a small number of
pages rather than scattered evenly through the entire table, requiring
potentially far more io to select.

Many thanks in advance and my apologies once again if this kind of thing has
been rejected before or is obviously impractical.

Jack

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Martijn van Oosterhout 2014-05-24 21:16:58 Re: new index type with clustering in mind.
Previous Message David G Johnston 2014-05-24 15:11:21 Re: COPY TO returns "ERROR: could not open file for writing: No such file or directory"