Re: Minmax indexes

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minmax indexes
Date: 2013-11-08 20:36:02
Message-ID: 20131108203602.GW5809@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas escribió:
> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
> > Here's an updated version of this patch, with fixes to all the bugs
> > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> > Amit Kapila for the reports.
>
> I'm not very happy with the use of a separate relation fork for
> storing this data.

I have been playing with having the revmap in the main fork of the index
rather than a separate one. On the surface many things stay just what
they are; I only had to add a layer beneath the revmap that maps its
logical block numbers to physical block numbers. The problem with this
is that it needs more disk access, because revmap block numbers cannot be
hardcoded.

After doing some quick math, what I ended up with was to keep an array
of BlockNumbers in the metapage. Each element in this array points to
array pages; each array page is, in turn, filled with more BlockNumbers,
which this time correspond to the logical revmap pages we used to have
in the revmap fork. (I initially feared that this design would not
allow me to address enough revmap pages for the largest of tables; but
fortunately this is sufficient unless you configure very small pages,
say BLCKSZ 2kB, use small page ranges, and use small datatypes, say
"char". I have no problem with saying that that scenario is not
supported if you want to have minmax indexes on 32 TB tables. I mean,
who uses BLCKSZ smaller than 8kB anyway?).

The advantage of this design is that in order to find any particular
logical revmap page, you always have to do a constant number of page
accesses. You read the metapage, then read the array page, then read
the revmap page; done. Another idea I considered was chaining revmap
pages (so each would have a pointer-to-next), or chaining array pages;
but this would have meant that to locate an individual page to the end
of the revmap, you might need to do many accesses. Not good.

As an optimization for relatively small indexes, we hardcode the page
number for the first revmap page: it's always the page right after the
metapage (so BlockNumber 1). A revmap page can store, with the default
page size, about 1350 item pointers; so with an index built for page
ranges of 1000 pages per range, you can point to enough index entries
for a ~10 GB table without having the need to examine the first array
page. This seems pretty acceptable; people with larger tables can
likely spare one extra page accessed every now and then.
(For comparison, each regular minmax page can store about 500 index
tuples, if it's built for a single 4-byte column; this means that the 10
GB table requires a 5-page index.)

This is not complete yet; although I have a proof-of-concept working, I
still need to write XLog support code and update the pageinspect code to
match.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2013-11-08 21:04:38 Re: Minmax indexes
Previous Message Heikki Linnakangas 2013-11-08 20:33:24 Re: Gin page deletion bug