Re: Block B-Tree concept

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Block B-Tree concept
Date: 2006-09-29 13:54:58
Message-ID: 451D25B2.9020208@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs wrote:
> On Fri, 2006-09-29 at 10:51 +0100, Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> If we want to keep the property that VACUUM doesn't re-evaluate index
>>> entries, any proposal that doesn't keep track of every heap tuple
>>> isn't going to work. I'll try to modify the design I had in mind so
>>> that it does keep track of every heap tuple in some form.
>
> The ideal situation is that we have one index pointer per block, so we
> should look for that and optimize accordingly. We mark the heap block to
> indicate how many block index pointers there are to it. If we have only
> a single pointer, then VACUUM will only have to touch the index pointer
> when the whole heap block is removed. In that case we have no
> re-evaluation of the index AFAICS.

I don't see how that would work. It sounds similar to the reference
counting option I proposed, which had the same re-evaluation problem.

And in addition, it requires adding index-specific information to the
heap page format, which troubles me from a modularization viewpoint.

>> After some thought:
>>
>> Imagine a normal B-tree just like what we have now. But when there is
>> more than one tuple on the same heap page with consecutive index keys,
>> we represent all of them in a single index tuple that contains the key
>> of the first one of them, and a (run-length encoded) bitmap of the
>> OffsetNumbers. We should get most of the space and I/O savings as with
>> the original proposal, but we can vacuum it without re-evaluating index
>> expressions.
>
> The benefit we're seeking with a block index is that most INSERTs don't
> write to the index. With that scheme we'd need to continually update the
> index tuple so that it exactly represented the heap after each inserted
> tuple, which is going to cause a hot block problem.

That's just one of the benefits. I think the main benefit is dramatic
reduction in index size which means that more of the index is cached.

An INSERT will have to find the corresponding leaf page anyway. Having
to dirty it isn't a big deal assuming that the hot blocks stay in cache.

The hot block problem worries me a bit too. Any indexing scheme that
packs more items on a block is going to suffer from that. Testing will
show if that becomes a problem.

> Can we have this scheme enabled *only* for functional block indexes?

No. As Tom pointed out, data type specific functions have potentially
the same problem.

And having both versions seems like a lot of code and complexity.

> The bitmap would allow us to access heap rows faster in some
> circumstances, I suppose.

Yes, you wouldn't have to re-evaluate index quals on every tuple, when
the whole range represented by the index tuple falls within the range
you're searching for. And when there's only few tuples with consecutive
keys on a heap page (which is not a good use case for block B-trees),
you don't need to scan the whole page to find those matches.

> Multi-block bitmaps do make this too much like bitmap indexes and the
> use-case is very different. [Is there some kind of hybrid solution of
> block & bitmap indexes?]

Not that I know of, though there is different kind of bitmap indexes.
The one that didn't make it to 8.2 uses equality encoding, where you
have a bitmap for every distinct value. You can also have
range-encoding, where you have a bitmap for ranges of values, for
example one bitmap for 1-10, another for 10-15 etc. If you choose the
ranges dynamically so that you have one range for each heap page (when
it's clustered), you get something similar to the proposed Block B-tree.

The current bitmap encoding scheme is optimized for large bitmaps,
though, so the performance wouldn't be as good.

>> It does change the format of an index tuple, unlike the original
>> proposal. That adds some complexity. but it's doable.
>
> Can we use an info bit to have two index tuple formats?
> - single tuple (as now)
> - multiple tuple block bitmap (as you suggest)

Yes. There's one bit free in the index tuple header.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Dunstan 2006-09-29 14:00:43 Re: Backup and restore through JDBC
Previous Message Martijn van Oosterhout 2006-09-29 13:51:56 Re: send()/receive() and on-disk storage