Re: Minmax indexes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minmax indexes
Date: 2014-06-23 15:16:29
Message-ID: 53A844CD.8030601@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Some comments, aside from the design wrt. bounding boxes etc. :

On 06/15/2014 05:34 AM, Alvaro Herrera wrote:
> Robert Haas wrote:
>> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
>> <alvherre(at)2ndquadrant(dot)com> wrote:
>>> Here's an updated version of this patch, with fixes to all the bugs
>>> reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
>>> Amit Kapila for the reports.
>>
>> I'm not very happy with the use of a separate relation fork for
>> storing this data.
>
> Here's a new version of this patch. Now the revmap is not stored in a
> separate fork, but together with all the regular data, as explained
> elsewhere in the thread.

Thanks! Please update the README accordingly.

If I understand the code correctly, the revmap is a three-level deep
structure. The bottom level consists of "regular revmap pages", and each
regular revmap page is filled with ItemPointerDatas, which point to the
index tuples. The middle level consists of "array revmap pages", and
each array revmap page contains an array of BlockNumbers of the "regular
revmap" pages. The top level is an array of BlockNumbers of the array
revmap pages, and it is stored in the metapage.

With 8k block size, that's just enough to cover the full range of 2^32-1
blocks that you'll need if you set mm_pages_per_range=1. Each regular
revmap page can store about 8192/6 = 1365 item pointers, each array
revmap page can store about 8192/4 = 2048 block references, and the size
of the top array is 8192/4. That's just enough; to store the required
number of array pages in the top array, the array needs to be
(2^32/1365)/2048)=1536 elements large.

But with 4k or smaller blocks, it's not enough.

I wonder if it would be simpler to just always store the revmap pages in
the beginning of the index, before any other pages. Finding the revmap
page would then be just as easy as with a separate fork. When the
table/index is extended so that a new revmap page is needed, move the
existing page at that block out of the way. Locking needs some
consideration, but I think it would be feasible and simpler than you
have now.

> I have followed the suggestion by Amit to overwrite the index tuple when
> a new heap tuple is inserted, instead of creating a separate index
> tuple. This saves a lot of index bloat. This required a new entry
> point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new
> function PageIndexDeleteNoCompact which is similar in spirit to
> PageIndexMultiDelete except that item pointers do not change. This is
> necessary because the revmap stores item pointers, and such reference
> would break if we were to renumber items in index pages.

ISTM that when the old tuple cannot be updated in-place, the new index
tuple is inserted with mm_doinsert(), but the old tuple is never deleted.

- Heikki

--
- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-06-23 15:19:44 Re: replication commands and log_statements
Previous Message Andres Freund 2014-06-23 15:16:15 Re: Atomics hardware support table & supported architectures