|From:||Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>|
|To:||Pg Hackers <pgsql-hackers(at)postgresql(dot)org>|
|Subject:||[RFC] Minmax indexes|
|Views:||Raw Message | Whole Thread | Download mbox|
This is a preliminary proposal for Minmax indexes. I'm experimenting
with the code, but it's too crude to post yet, so here's a document
explaining what they are and how they work, so that reviewers can poke
holes to have the design improved. My intention is to have a patch to
show for CF2, so please do have a look at this and comment.
This is part of the AXLE project http://www.axleproject.eu and the
intention is to support tables of very large size. In a quick
experiment, I have a table of ~12 GB and its corresponding index is 65
kB in size, making the time to do the equivalent of a seqscan a small
fraction of that taken by a real seqscan. This technique sits between a
bitmap scan of a normal btree, and a seqscan: the minmax index tells the
bitmap heap scan what pages to seqscan, allowing it to skip a large
fraction of pages that are known not to contain tuples matching the
query quals. This is a huge win for large data warehouses.
Without further ado, here's what I propose.
Minmax Range Indexes
Minmax indexes are a new access method intended to enable very fast scanning of
extremely large tables.
The essential idea of a minmax index is to keep track of the min() and max()
values in consecutive groups of heap pages (page ranges). These values can be
used by constraint exclusion to avoid scanning such pages, depending on query
The main drawback of this is having to update the stored min/max values of each
page range as tuples are inserted into them.
Other database systems already have this feature. Some examples:
* Oracle Exadata calls this "storage indexes"
* Netezza has "zone maps"
* Infobright has this automatically within their "data packs"
* MonetDB seems to have it
"Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
To create a minmax index, we use
CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
Partial indexes are not supported; since an index is concerned with minimum and
maximum values of the involved columns across all the pages in the table, it
doesn't make sense to exclude values. Another way to see "partial" indexes
here would be those that only considered some pages in the table instead of all
of them; but this would be difficult to implement and manage and, most likely,
Expressional indexes can probably be supported in the future, but we disallow
them initially for conceptual simplicity.
Having multiple minmax indexes in the same table is acceptable, though most of
the time it would make more sense to have a single index covering all the
interesting columns. Multiple indexes might be useful for columns added later.
Access Method Design
Since item pointers are not stored inside indexes of this type, it is not
possible to support the amgettuple interface. Instead, we only provide
amgetbitmap support; scanning a relation using this index requires a recheck
node on top. The amgetbitmap routine would return a TIDBitmap comprising all
the pages in those page groups that comply with the query quals; the recheck node
prunes tuples that are not visible per snapshot and those that are not visible
per query quals.
For each supported datatype, we need an opclass with the following catalog
- support functions (pg_amproc)
* pg_proc entry for min()
* pg_proc entry for max()
- support operators (pg_amop): same as btree (<, <=, =, >=, >)
The min() and max() support functions are used during index construction.
The support operators are used in the optimizer, so that the index is chosen
when queries on the indexed table are planned. (Also, we use them in the
amgetbitmap routine, to compare ScanKeys and decide whether to emit a certain
block or not).
In each index tuple (corresponding to one page range), we store:
- first block this tuple applies to
- last block this tuple applies to
- for each indexed column:
* min() value across all tuples in the range
* max() value across all tuples in the range
* nulls present in any tuple?
With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
types (timestamptz, bigint), each tuple would be 524 bytes in length, which
seems reasonable. Of course, larger columns are possible, such as varchar, but
creating minmax indexes on such columns seems of little practical usefulness.
This maximum index tuple size is calculated as:
BlockNumber (4 bytes) * 2 + data value (8 bytes) * 32 * 2 + null bitmap (4 bytes)
Block ranges mentioned in index entries shouldn't overlap. However, there can
be gaps where some pages have no covering index entry. (In particular, the last
few pages of the table would commonly not be summarized.)
In order to scan a table, a sequential scan of the index is done; for each
tuple for which the range limits (min/max) are proven to be required to be
scanned by the ScanKeys, we obtain the block range covered by that tuple,
and return a lossy TIDBitmap of pages in the page range; any unsummarized pages
and invalid page ranges (see below) must also be returned.
Note that we store pg_proc entries for aggregate functions. It is the
responsibility of the access method to obtain the pg_aggregate rows starting
from the pg_proc rows; we avoid the parser code that deals with aggregates,
because the mechanism it uses to resolve from parser to executor starts from
aggregate names; this is not useful for our purposes.
Also, running the aggregate is not done through the executor support for
aggregates either, because that code is very specialized to running inside a
node with a child node feeding tuples, which IndexBuildHeapScan cannot do.
Instead, the AM is in charge of initializing state and running the transition
functions for each new tuple read by the index-build heap scan.
Data changes in the heap
Value changes in columns that are part of a minmax index, and tuple insertion
in summarized pages, would invalidate the stored min/max values. To support
this, each minmax index has a validity map; a range can only be considered in a
scan if it hasn't been invalidated by such changes (A range "not considered" in
the scan needs to be returned in whole regardless of the stored min/max values,
that is, it cannot be pruned per query quals). The validity map is very
similar to the visibility map in terms of performance characteristics: quick
enough that it's not contentious, allowing updates and insertions to proceed
even when data values violate the minmax index conditions. An invalidated
range can be made valid by re-summarization (see below).
Re-summarization is relatively expensive, because the complete page range has
to be scanned. To avoid this, a table having a minmax index would be
configured so that inserts only go to the page(s) at the end of the table; this
avoids frequent invalidation of ranges in the middle of the table. We provide
a table reloption that tweaks the FSM behavior, so that summarized pages are
not candidates for insertion.
A possible optimization is that whenever an update does not modify the values
covered by minmax indexes, or whenever the value changes are not outside the
min()/max() interval for the corresponding page range, the page range does not
need to be marked invalid. This is particularly useful when considering
same-page updates (i.e. updates which find enough empty space in the same page
that the new tuple fits in it), and HOT (which frees space used by previous
updates). This suggests we ought to recommend lower-than-100 fillfactor values
to permit these features to kick in.
When tuples are added to unsummarized pages, nothing needs to happen to the
Tuples can be removed from anywhere without restriction. See below for more on
At index creation time, the whole table is scanned; for each page range the
min() and max() values and nulls bitmap are collected, and stored in the index.
The possibly-incomplete range at the end of the table is also included.
Once in a while, it is necessary to summarize a bunch of unsummarized pages
(because the table has grown since the index was created), or re-summarize a
range that has been marked invalid. This is simple: scan the page range
calculating the min() and max() for each indexed column, then insert the new
index entry at the end of the index. The main interesting questions are:
a) when to do it
The perfect time to do it is as soon as a complete page range of the
configured range size has been filled (assuming page ranges are constant
b) who does it (what process)
It doesn't seem a good idea to have a client-connected process do it;
it would incur unwanted latency. Three other options are (i) to spawn a
specialized process to do it, which perhaps can be signalled by a
client-connected process that executes a scan and notices the need to run
summarization; or (ii) to let autovacuum do it, as a separate new
maintenance task. This seems simple enough to bolt on top of already
existing autovacuum infrastructure. The timing constraints of autovacuum
might be undesirable, though. (iii) wait for user command.
The easiest way to go around this seems to have vacuum do it. That way we can
simply do re-summarization on the amvacuumcleanup routine. Other answers would
mean we need a separate AM routine, which appears unwarranted at this stage.
Vacuuming a table that has a minmax index does not represent a significant
challenge. Since no CTIDs are stored, it's not necessary to scan the index
when heap tuples are removed. It might be that some min() value can be
incremented, or some max() value can be decremented; but this would represent
an optimization opportunity only, not a correctness issue. Perhaps it's
simpler to represent this as the need to re-run summarization on the affected
Note that if there are no indexes on the table other than the minmax index,
usage of maintenance_work_mem by vacuum can be decreased significantly, because
no detailed index scan needs to take place (and thus it's not necessary for
vacuum to save TIDs to remove). This optimization opportunity is best left for
In order to make this all work, the only thing we need to do is ensure we have a
good enough opclass and amcostestimate. With this, the optimizer is able to pick
up the index on its own.
* Same-size page ranges?
Current related literature seems to consider that each "index entry" in a
minmax index must cover the same number of pages. There doesn't seem to be a
hard reason for this to be so; it might make sense to allow the index to
self-tune so that some index entries cover smaller page ranges, if this allows
the min()/max() values to be more compact. This would incur larger
space overhead for the index itself, but might allow better pruning of
page ranges during scan. In the limit of one index tuple per page,
the index itself would occupy too much space, even though we would be
able to skip reading the most heap pages, because the min()/max()
ranges are tight; in the opposite limit of a single tuple that
summarizes the whole table, we wouldn't be able to prune anything from
the seqscan even though the index is very small.
* More compact representation for TIDBitmap?
TIDBitmap is the structure currently used to represent bitmap scans. The
representation of lossy page ranges is not optimal for our purposes, because
it uses a Bitmapset to represent pages in the range; since we're going to return
all pages in a large range, it might be more convenient to allow for a
struct that instead uses start and end page numbers to represent the range.
This is mentioned in passing in tidbitmap.c comments, so perhaps the
time has come for its implementation. (In experimentation, tidbitmap
has no trouble with ranges of a thousand pages.)
Email thread on pgsql-hackers
From: Simon Riggs
Subject: Dynamic Partitioning using Segment Visibility Map
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
|Next Message||Robert Haas||2013-06-14 22:30:22||Re: single-user vs standalone in docs and messages|
|Previous Message||Robert Haas||2013-06-14 21:00:13||dynamic background workers|