Re: [RFC] Minmax indexes

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [RFC] Minmax indexes
Date: 2013-06-17 21:42:03
Message-ID: 51BF82AB.1020208@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

This sounds really interesting, so a few quick comments.

On 15.6.2013 00:28, Alvaro Herrera wrote:

> In each index tuple (corresponding to one page range), we store: -
> first block this tuple applies to - last block this tuple applies to
> - for each indexed column: * min() value across all tuples in the
> range * max() value across all tuples in the range * nulls present in
> any tuple?

What about adding a counter how many times the min/max value is present
in the range? The other messages in this thread suggest that the refresh
after UPDATE/DELETE is one of the concerns - as Greg Stark mentioned the
range invalidation may only happen when running DELETE on a row matching
the min/max value. I believe having a counter would be an improvement -
a refresh would be needed only if it reaches 0.

> Summarization -------------
>
> At index creation time, the whole table is scanned; for each page
> range the min() and max() values and nulls bitmap are collected, and
> stored in the index. The possibly-incomplete range at the end of the
> table is also included.

Would it make sense to do this using an existing (b-tree) index for very
large tables? Clearly it doesn't make sense to create a b-tree index and
then minmax on the same column, but for very large databases upgraded
using pg_upgrade this might be interesting.

> Once in a while, it is necessary to summarize a bunch of unsummarized
> pages (because the table has grown since the index was created), or
> re-summarize a range that has been marked invalid. This is simple:
> scan the page range calculating the min() and max() for each indexed
> column, then insert the new index entry at the end of the index. The
> main interesting questions are:
>
> a) when to do it The perfect time to do it is as soon as a complete
> page range of the configured range size has been filled (assuming
> page ranges are constant size).
>
> b) who does it (what process) It doesn't seem a good idea to have a
> client-connected process do it; it would incur unwanted latency.
> Three other options are (i) to spawn a specialized process to do it,
> which perhaps can be signalled by a client-connected process that
> executes a scan and notices the need to run summarization; or (ii) to
> let autovacuum do it, as a separate new maintenance task. This seems
> simple enough to bolt on top of already existing autovacuum
> infrastructure. The timing constraints of autovacuum might be
> undesirable, though. (iii) wait for user command.
>
>
> The easiest way to go around this seems to have vacuum do it. That
> way we can simply do re-summarization on the amvacuumcleanup routine.
> Other answers would mean we need a separate AM routine, which appears
> unwarranted at this stage.

I don't think this should be added to the autovacuum daemon. It's quite
tricky to tune autovacuum to be aggressive just enough, i.e. not to run
too frequently / not to lag. I'm afraid this would add task consuming
unpredictable amounts of time, which would make the autovacuum tuning
even trickier.

I can live with non-summarized minmax indexes, but I can't live with
non-vacuumed tables.

> Open questions --------------
>
> * Same-size page ranges? Current related literature seems to consider
> that each "index entry" in a minmax index must cover the same number
> of pages. There doesn't seem to be a hard reason for this to be so;
> it might make sense to allow the index to self-tune so that some
> index entries cover smaller page ranges, if this allows the
> min()/max() values to be more compact. This would incur larger space
> overhead for the index itself, but might allow better pruning of page
> ranges during scan. In the limit of one index tuple per page, the
> index itself would occupy too much space, even though we would be
> able to skip reading the most heap pages, because the min()/max()
> ranges are tight; in the opposite limit of a single tuple that
> summarizes the whole table, we wouldn't be able to prune anything
> from the seqscan even though the index is very small.

I see no particular reason not to allow that, and having variable range
size would be a very nice feature IMHO.

Do you have an indea on how the self-tuning might work?

I was thinking about something like this:

(1) estimate the initial range size, trying to keep good "selectivity"
while not having very small ranges, using

(a) average row length
(b) number of distinct values in the column
(c) MCV/histogram/correlation

(2) once the index is built, running a "merge" on the ranges - merging
the adjacent pages if the resulting selectivity is not significantly
worse (again, this might be evaluated using MCV, histogram)

But it should be possible to override this (initial range size, max
range size) or disable heuristics completely, as having large ranges
probably makes the resummarizing more expensive. Although having the
counter might fix that as it's more likely there's another row with
min/max value.

BTW could this be used to pre-sort the table, so that the actual sort
algorithm would start with a reasonably sorted set of tuples? I.e.
sorting the index by (min,max) vector and then returning the table pages
in this order? I think it might be even possible to efficiently split
the table into multiple parts T(0), T(1), T(2), ..., T(n) where all rows
in T(i) are smaller than T(i+1).

This in turn would make it possible to "partition" the table for
aggregates - either running them in parallel or in sequence (i.e. not
running into OOM with HashAggregate). Yes, I know supporting this is far
in the future, I'm just "brainstorming" here ;-)

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2013-06-17 22:30:17 Re: Batch API for After Triggers
Previous Message Alvaro Herrera 2013-06-17 21:31:46 Re: [PATCH] Add transforms feature