Since the discussion in September on Block B-Tree concept, I've worked
on that and it has evolved to what I call "Grouped Index Tuples". That
means that one index tuple in a B-tree can represent many heap tuples
that are on the same page, and have monotonic keys. By monotonic, I mean
that if A and B are the smallest and largest keys of the heap tuples
represented by the grouped index tuple, there is no indexed tuple with
key A < X <= B.
The format of a grouped index tuple is the same as a normal index tuple,
except that instead of storing the offset of a single heap tuple, a
bitmap of offsets is stored.
Essentially, it's a lossy compression scheme for B-tree indexes that
takes advantage of clustering in the table. On a fully clustered table,
it can achieve impressive compression by effectively storing only one
index tuple per heap page instead of one per heap tuple. That can be a
100-fold reduction if in the best case of very narrow rows and full
clustering. On a completely non-ordered table, it degrades to a normal
B-tree. Smaller index size is important because it means less I/O which
means more overall throughput.
The cost of not storing the key of every heap tuple is that when
scanning the index, all heap tuples represented by a matching grouped
index tuple need to be fetched, index keys recalculated, and compared
against the scan quals. However, because we only group tuples that are
on the same heap page, that doesn't involve more I/O, just more CPU
usage. (of course, if you're database already fits in memory and you're
at 100% CPU usage, you're better off without this).
Just to be clear: This doesn't incur any extra cost (nor savings) to
scans or inserts when no grouped tuples are involved. And you can have a
mixture of grouped and normal index tuples in the same index.
As a side note: Can you come up with particularly bad scenario where
this would lead to a dramatic performance loss? I can't immediately
think of one, though I'm sure there's scenarios where this would lead to
a moderate decrease in performance. It needs to be considered if we want
to enable this by default.
A patch against 8.2 beta3 that implements this is attached (gzipped).
It's fully functional, but not very well tested, and there's a lot of
small refactorings and improvements that could be made. It's a very long
patch, so I'll try to summarize the changes in it (see also the
additions to nbtree/README in it):
* When a normal index scan encounters a grouped tuple, all the heap
tuples represented by it need to be fetched, scan quals re-evaluated,
and sorted before returning them to the executor.
* When a bitmap scan encounters a grouped tuple, it returns a lossy
pointer to the whole heap page, and the bitmap heap scan code takes care
of scanning the heap page and re-evaluating scan quals. If the
amgetmulti API had a way to return "candidate" matches, the bitmap heap
scan code could scan just the potentially matching tuples.
* The logic of locating the starting key in a scan needs to know that if
no exact match is found, the previous item might be a grouped tuple that
contains a matching key.
* A grouped index tuple that doesn't match a non-boundary scan qual
can't be skipped, because some of the heap tuples it represents might
* When inserting a new tuple, if the key of the new tuple falls within
the range of tuples represented by a grouped tuple, the grouped tuple
needs to be split to two tuples. For example, consider a heap and index
key | heap blk | heap offsets
10 | 1 | 1 2 3 4 5
60 | 2 | 1 2
An insert of key 35 to heap page 2 would need to split the first tuple
into two halves:
key | heap blk | heap offsets
10 | 1 | 1 2 3
35 | 2 | 3
40 | 1 | 4 5
60 | 2 | 1 2
The changes in nbtinsert.c are to support this groupsplit operation.
Splitting the original tuple to two halves needs to be done atomically
in one WAL record, which is why I had to modify the _bt_insertonpg and
_bt_split functions to support that.
* Uniqueness check needs to fetch heap tuples represented by grouped
tuples and re-calculate index keys just like the search logic.
* All new tuples are initially inserted as normal index tuples, but when
a page gets full, they're coalesced to grouped tuples to make room in a
"groupify" operation. Originally I tried to eagerly set new bits in
grouped tuples, but doing it lazily as a bulk operation simplifies the
insertion logic quite a bit. Besides, we're making a better use of the
space on a page this way.
* Vacuum code hasn't changed much. Unlike in my first proposal, the
index still stores a reference to all indexed heap tuples in the form of
a bitmap, so they can be vacuumed as usual. The offset bitmaps are to be
decoded, the callback function is called on each bit or offset
individually, and the bits representing dead tuples are cleared.
I've refrained from refactoring old code and modifying the indexam API
as much as possible, to make the patch less invasive and easier to
review. However, before applying, there is some things I'd like to do:
* add support for candidate matches in the amgetmulti API. This should
be thought together with the bitmap index code; I think supporting range
encoded bitmap indexes is going to need that capability as well.
* add support for candidate matches and partial ordering to the normal
index scan code, thus moving the logic of scanning a heap page,
re-evaluating index keys and sorting tuples out of B-tree code. Again I
think this capability could also be used to support range queries with
* clean up the insertonpg/split code. I tried to make as little changes
as possible to ease review, but it just doesn't look pretty anymore. And
the WAL records have become very complex,
* clean up the hacks to attach bitmaps to IndexTuples. Maybe it's time
to re-introduce the BTItem struct we had at one point, but again I
didn't want to change the existing code where it wasn't absolutely
* move the parse_* functions out of guc.c to a separate module and make
them non-static, so they could easily be used to parse the WITH (xxx)
options in the indexam and elsewhere.
There's also some additional features/optimizations we could do, but I
haven't bothered with yet:
* improve the cost model. I haven't done anything to that. I'm not sure
if and how it needs to be changed, but it's certainly something to think
* add more control (or even better, automatic tuning) of the threshold
of coalescing tuples to grouped ones. For example, the CPU cost of all
the extra index key evaluations and comparisons might outweight the I/O
savings if you have an expression index with a very expensive function,
and it would be nice to take that into account.
* When all the heap tuples represented by a grouped index tuple have the
same key, the key on the index tuple could be evaluated against
non-boundary scan quals, though that's not the case in general. The same
applies to uniqueness checks.
* I'm not sure what we should be counting in pg_stats. The patch adds a
number of extra counters that have been useful in debugging, but what
should be counted as a idx_tup_fetch for example?
* and of course, there's no user documentation yet ;)
I'm hoping we could hash out the above API changes and do some
refactoring early in the 8.3 cycle, and commit this as a smaller patch
pgsql-patches by date
|Next:||From: Tom Lane||Date: 2006-11-10 22:32:52|
|Subject: Re: [HACKERS] Bug in WAL backup documentation |
|Previous:||From: Brendan Jurd||Date: 2006-11-09 19:46:57|
|Subject: Re: [GENERAL] ISO week dates|