Re: WIP: BRIN multi-range indexes

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Mark Dilger <hornschnorter(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: WIP: BRIN multi-range indexes
Date: 2019-02-23 22:23:22
Message-ID: 28c69eb3-3b93-f9f8-1e76-cfac1eda5254@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Apparently cputube did not pick the last version of the patches I've
submitted in December (and I don't see the message in the thread in
archive either), so it's listed as broken.

So here we go again, hopefully this time everything will go through ...

regards

On 12/28/18 12:45 AM, Tomas Vondra wrote:
> Hi all,
>
> Attached is an updated/rebased version of the patch series. There are no
> changes to behavior, but let me briefly summarize the current state:
>
> 0001 and 0002
> -------------
>
> The first two parts are "just" refactoring the existing code to pass all
> scankeys to the opclass at once - this is needed by the new minmax-like
> opclass, but per discussion with Alvaro it seems worthwhile even
> independently. I tend to agree with that. Similarly for the second part,
> which moves all IS NULL checks entirely to bringetbimap().
>
> 0003 bloom opclass
> ------------------
>
> The first new opclasss, based on bloom filters. For each page range
> (i.e. 1MB by default) a small bloom filter is built (with hash values of
> the original values as inputs), and then used to evaluate equality
> queries. A small optimization is that initially the actual (hash) values
> are kept until reaching the bloom filter size. This improves behavior in
> low-cardinality data sets.
>
> Picking the bloom filter parameters is the tricky part - we don't have a
> reliable source of such information (namely number of distinct values
> per range), and e.g. the false positive rate actually has to be picked
> by the user because it's a compromise between index size and accuracy.
> Essentially, false positive rate is the fraction of the table that has
> to be scanned for a random value (on average). But it also makes the
> index larger, because the per-range bloom filters will be larger.
>
> Another reason why this needs to be defined by the user is that the
> space for index tuple is limited by one page (8kB by default), so we
> can't allow the bloom filter to be larger (we have to assume it's
> non-compressible, because in the optimal fill it's 50% 0s and 1s). But
> the BRIN index may be multi-column, and the limit applies to the whole
> tuple. And we don't know what the opclasses or parameters of other
> columns are.
>
> So the patch simply adds two reloptions
>
> a) n_distinct_per_range - number of distinct values per range
> b) false_positive_rate - false positive rate of the filter
>
> There are some simple heuristics to ensure the values are reasonable
> (e.g. upper limit for number of distinct values, etc.) and perhaps we
> might consider stats from the underlying table (when not empty), but the
> patch does not do that.
>
>
> 0004 multi-minmax opclass
> -------------------------
>
> The second opclass addresses a common issue for minmax indexes, where
> the table is initially nicely correlated with the index, and it works
> fine. But then deletes/updates route data into other parts of the table
> making the ranges very wide ad rendering the BRIN index inefficient.
>
> One way to deal improve this would be considering the index(es) while
> routing the new tuple, i.e. looking not only for page with enough free
> space, but for pages in already matching ranges (or close to it).
>
> A partitioning is a possible approach so segregate the data. But it's
> certainly much higher overhead, both in terms of maintenance and
> planning (particularly with 1:1 of ranges vs. partitions).
>
> So the new multi-minmax opclass takes a different approach, replacing
> the one minmax range with multiple ranges (64 boundary values or 32
> ranges by default). Initially individual values are stored, and after
> reaching the maximum number of values the values are merged into ranges
> by distance. This allows handling outliers very efficiently, because
> they will not be merged with the "main" range for as long as possible.
>
> Similarly to the bloom opclass, the main challenge here is deciding the
> parameter - in this case, it's "number of values per range". Again, it's
> a compromise vs. index size and efficiency. The default (64 values) is
> fairly reasonable, but ultimately it's up to the user - there is a new
> reloption "values_per_range".
>
>
>
> regards
>

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-Pass-all-keys-to-BRIN-consistent-function-a-20190223.patch.gz application/gzip 5.4 KB
0002-Move-IS-NOT-NULL-checks-to-bringetbitmap-20190223.patch.gz application/gzip 3.0 KB
0003-BRIN-bloom-indexes-20190223.patch.gz application/gzip 20.6 KB
0004-BRIN-multi-range-minmax-indexes-20190223.patch.gz application/gzip 29.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-02-23 23:06:07 Re: explain plans with information about (modified) gucs
Previous Message Tomas Vondra 2019-02-23 21:27:44 Re: FETCH FIRST clause PERCENT option