Skip site navigation (1) Skip section navigation (2)

Grouped index items (for discussion for 8.3)

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: <pgsql-patches(at)postgresql(dot)org>
Subject: Grouped index items (for discussion for 8.3)
Date: 2006-11-10 12:18:45
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-patches
Since the discussion in September on Block B-Tree concept, I've worked 
on that and it has evolved to what I call "Grouped Index Tuples". That 
means that one index tuple in a B-tree can represent many heap tuples 
that are on the same page, and have monotonic keys. By monotonic, I mean 
that if A and B are the smallest and largest keys of the heap tuples 
represented by the grouped index tuple, there is no indexed tuple with 
key A < X <= B.

The format of a grouped index tuple is the same as a normal index tuple, 
except that instead of storing the offset of a single heap tuple, a 
bitmap of offsets is stored.

Essentially, it's a lossy compression scheme for B-tree indexes that 
takes advantage of clustering in the table. On a fully clustered table, 
it can achieve impressive compression by effectively storing only one 
index tuple per heap page instead of one per heap tuple. That can be a 
100-fold reduction if in the best case of very narrow rows and full 
clustering. On a completely non-ordered table, it degrades to a normal 
B-tree. Smaller index size is important because it means less I/O which 
means more overall throughput.

The cost of not storing the key of every heap tuple is that when 
scanning the index, all heap tuples represented by a matching grouped 
index tuple need to be fetched, index keys recalculated, and compared 
against the scan quals. However, because we only group tuples that are 
on the same heap page, that doesn't involve more I/O, just more CPU 
usage. (of course, if you're database already fits in memory and you're 
at 100% CPU usage, you're better off without this).

Just to be clear: This doesn't incur any extra cost (nor savings) to 
scans or inserts when no grouped tuples are involved. And you can have a 
mixture of grouped and normal index tuples in the same index.

As a side note: Can you come up with particularly bad scenario where 
this would lead to a dramatic performance loss? I can't immediately 
think of one, though I'm sure there's scenarios where this would lead to 
a moderate decrease in performance. It needs to be considered if we want 
to enable this by default.

A patch against 8.2 beta3 that implements this is attached (gzipped). 
It's fully functional, but not very well tested, and there's a lot of 
small refactorings and improvements that could be made. It's a very long 
patch, so I'll try to summarize the changes in it (see also the 
additions to nbtree/README in it):

Search logic

* When a normal index scan encounters a grouped tuple, all the heap 
tuples represented by it need to be fetched, scan quals re-evaluated, 
and sorted before returning them to the executor.

* When a bitmap scan encounters a grouped tuple, it returns a lossy 
pointer to the whole heap page, and the bitmap heap scan code takes care 
of scanning the heap page and re-evaluating scan quals. If the 
amgetmulti API had a way to return "candidate" matches, the bitmap heap 
scan code could scan just the potentially matching tuples.

* The logic of locating the starting key in a scan needs to know that if 
no exact match is found, the previous item might be a grouped tuple that 
contains a matching key.

* A grouped index tuple that doesn't match a non-boundary scan qual 
can't be skipped, because some of the heap tuples it represents might 
still match.

Insertion logic

* When inserting a new tuple, if the key of the new tuple falls within 
the range of tuples represented by a grouped tuple, the grouped tuple 
needs to be split to two tuples. For example, consider a heap and index 
like this:


key | heap blk | heap offsets
10  | 1        | 1 2 3 4 5
60  | 2        | 1 2

An insert of key 35 to heap page 2 would need to split the first tuple 
into two halves:

key | heap blk | heap offsets
10  | 1        | 1 2 3
35  | 2        | 3
40  | 1        | 4 5
60  | 2        | 1 2

The changes in nbtinsert.c are to support this groupsplit operation. 
Splitting the original tuple to two halves needs to be done atomically 
in one WAL record, which is why I had to modify the _bt_insertonpg and 
_bt_split functions to support that.

* Uniqueness check needs to fetch heap tuples represented by grouped 
tuples and re-calculate index keys just like the search logic.

* All new tuples are initially inserted as normal index tuples, but when 
a page gets full, they're coalesced to grouped tuples to make room in a 
"groupify" operation. Originally I tried to eagerly set new bits in 
grouped tuples, but doing it lazily as a bulk operation simplifies the 
insertion logic quite a bit. Besides, we're making a better use of the 
space on a page this way.


* Vacuum code hasn't changed much. Unlike in my first proposal, the 
index still stores a reference to all indexed heap tuples in the form of 
a bitmap, so they can be vacuumed as usual. The offset bitmaps are to be 
decoded, the callback function is called on each bit or offset 
individually, and the bits representing dead tuples are cleared.

I've refrained from refactoring old code and modifying the indexam API 
as much as possible, to make the patch less invasive and easier to 
review. However, before applying, there is some things I'd like to do:

* add support for candidate matches in the amgetmulti API. This should 
be thought together with the bitmap index code; I think supporting range 
encoded bitmap indexes is going to need that capability as well.

* add support for candidate matches and partial ordering to the normal 
index scan code, thus moving the logic of scanning a heap page, 
re-evaluating index keys and sorting tuples out of B-tree code. Again I 
think this capability could also be used to support range queries with 
bitmap indexes.

* clean up the insertonpg/split code. I tried to make as little changes 
as possible to ease review, but it just doesn't look pretty anymore. And 
the WAL records have become very complex,

* clean up the hacks to attach bitmaps to IndexTuples. Maybe it's time 
to re-introduce the BTItem struct we had at one point, but again I 
didn't want to change the existing code where it wasn't absolutely 

* move the parse_* functions out of guc.c to a separate module and make 
them non-static, so they could easily be used to parse the WITH (xxx) 
options in the indexam and elsewhere.

There's also some additional features/optimizations we could do, but I 
haven't bothered with yet:

* improve the cost model. I haven't done anything to that. I'm not sure 
if and how it needs to be changed, but it's certainly something to think 

* add more control (or even better, automatic tuning) of the threshold 
of coalescing tuples to grouped ones. For example, the CPU cost of all 
the extra index key evaluations and comparisons might outweight the I/O 
savings if you have an expression index with a very expensive function, 
and it would be nice to take that into account.

* When all the heap tuples represented by a grouped index tuple have the 
same key, the key on the index tuple could be evaluated against 
non-boundary scan quals, though that's not the case in general. The same 
applies to uniqueness checks.

* I'm not sure what we should be counting in pg_stats. The patch adds a 
number of extra counters that have been useful in debugging, but what 
should be counted as a idx_tup_fetch for example?

* and of course, there's no user documentation yet ;)

I'm hoping we could hash out the above API changes and do some 
refactoring early in the 8.3 cycle, and commit this as a smaller patch 
after that.

   Heikki Linnakangas

Attachment: groupeditems-18-beta3.diff.gz
Description: application/x-gzip (64.1 KB)

pgsql-patches by date

Next:From: Tom LaneDate: 2006-11-10 22:32:52
Subject: Re: [HACKERS] Bug in WAL backup documentation
Previous:From: Brendan JurdDate: 2006-11-09 19:46:57
Subject: Re: [GENERAL] ISO week dates

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group