Re: [RFC] Minmax indexes

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] Minmax indexes
Date: 2013-07-19 05:39:58
Message-ID: 20130719053957.GD4139@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Alvaro Herrera wrote:

> This is a preliminary proposal for Minmax indexes. I'm experimenting
> with the code, but it's too crude to post yet, so here's a document
> explaining what they are and how they work, so that reviewers can poke
> holes to have the design improved.

After going further with the implementation, I have added a new concept,
the reverse range map.

Reverse Range Map
-----------------

The reverse range map is a separate fork each Minmax index has. Its purpose
is to let a way to quickly find the index tuple corresponding to a page range;
for a given heap page number, there's an O(1) way to obtain the TID of the
corresponding index tuple.

To scan the index, we first obtain the TID of index tuple for page 0. If
this returns a valid TID, we read that tuple to determine the min/max bounds
for this page range. If it returns invalid, then the range is unsummarized,
and the scan must return the whole range as needing scan. After this
index entry has been processed, we obtain the TID of the index tuple for
page 0+pagesPerRange (currently this is a compile-time constant, but
there's no reason this cannot be a per-index property). Continue adding
pagesPerRange until we reach the end of the heap.

To read the TID during an index scan, we follow this protocol:

* read revmap page
* obtain share lock on the buffer
* read the TID
* LockTuple that TID (using the index as relation). A shared lock is
sufficient. We need the LockTuple to prevent VACUUM from recycling
the index tuple; see below.
* release buffer lock
* read the index tuple
* release the tuple lock

On heap insert/update, it is normally cheaper to update the summary tuple
(grab the old tuple, expand the min/max range according to the new value
being inserted, insert the new index tuple, update the reverse range map)
than letting the range be unsummarized, which would require scanning the
pages in the range.

[However, when many updates on a range are going to be done, it'd be
better to mark it as unsummarized (i.e. set the revmap TID to Invalid)
and do a resummarization later, to prevent the index from bloating.
Also, it'd be sensible to allow postponing of the index update, if many
tuples are inserted; something like GIN's "pending list". We would need
to keep track the TIDs of the inserted heap tuples, so that we can
figure out the new min/max values, without having to scan the whole
range.]

To update the summary tuple for a page range, we use this protocol:

* insert a new index tuple anywhere; note its TID
* read revmap page
* obtain exclusive lock on buffer
* write the TID
* release lock

This ensures no concurrent reader can obtain a partially-written TID.
Note we don't need a tuple lock here. Concurrent scans don't have to
worry about whether they got the old or new index tuple: if they get the
old one, the tighter values are okay from a correctness standpoint because
due to MVCC they can't possibly see the just-inserted heap tuples anyway.

For vacuuming, we need to figure out which index tuples are no longer
referenced from the reverse range map. This requires some brute force,
but is simple:

1) scan the complete index, store each existing TIDs in a dynahash.
Hash key is the TID, hash value is a boolean initially set to false.
2) scan the complete revmap sequentially, read the TIDs on each page. Share
lock on each page is sufficient. For each TID so obtained, grab the
element from the hash and update the boolean to true.
3) Scan the index again; for each tuple found, search the hash table.
If the tuple is not present, it must have been added after our initial
scan; ignore it. If the hash flag is true, then the tuple is referenced;
ignore it. If the hash flag is false, then the index tuple is no longer
referenced by the revmap; but they could be about to be accessed by a
concurrent scan. Do ConditionalLockTuple. If this fails, ignore the
tuple, it will be deleted by a future vacuum. If lock is acquired,
then we can safely remove the index tuple.
4) Index pages with free space can be detected by this second scan. Register
those with the FSM.

Note this doesn't require scanning the heap at all, or being involved in
the heap's cleanup procedure. (In particular, tables with only minmax
indexes do not require the removed tuples' TIDs to be collected during
the heap cleanup pass.) XXX I think this is wrong in detail; we
probably need to be using LockBufferForCleanup.

With the reverse range map it is very easy to see which page ranges in
the heap need resummarization; it's the ones marked with InvalidTid.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2013-07-19 05:49:53 Re: [PATCH] pgbench --throttle (submission 7 - with lag measurement)
Previous Message Tom Lane 2013-07-19 05:27:41 Re: [HACKERS] getting rid of SnapshotNow