Re: WIP: BRIN multi-range indexes

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: BRIN multi-range indexes
Date: 2020-09-17 16:34:26
Message-ID: 20200917163426.zzyi62zshwemkfhx@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 17, 2020 at 10:33:06AM -0400, John Naylor wrote:
>On Sun, Sep 13, 2020 at 12:40 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> <20200913 patch set>
>Hi Tomas,
>The cfbot fails to apply this, but with 0001 from 0912 it works on my
>end, so going with that.

Hmm, it seems to fail because of some whitespace errors. Attached is an
updated version resolving that.

>One problem I have is I don't get success with the new reloptions:
>create index cd_multi on iot using brin(create_dt
>timestamptz_minmax_multi_ops) with (values_per_range = 64);
>ERROR: unrecognized parameter "values_per_range"
>create index on iot using brin(create_dt timestamptz_bloom_ops) with
>(n_distinct_per_range = 16);
>ERROR: unrecognized parameter "n_distinct_per_range"

But those are opclass parameters, so the parameters are not specified in
WITH clause, but right after the opclass name:

CREATE INDEX idx ON table USING brin (
bigint_col int8_minmax_multi_ops(values_per_range = 15)

>Aside from that, I'm going to try to understand the code, and ask
>questions. Some of the code might still change, but I don't think it's
>too early to do some comment and docs proofreading. I'll do this in
>separate emails for bloom and multi-minmax to keep it from being too
>long. Perf testing will come sometime later.


>+ greater than 0.0 and smaller than 1.0. The default values is 0.01,
>+ rows per block). The default values is <literal>-0.1</literal>, and
>+ the minimum number of distinct non-null values is <literal>128</literal>.
>I don't see 128 in the code, but I do see this, is this the intention?:

Ah, that's right - I might have lowered the default after writing the
comment. Will fix.

>+ * Bloom filters allow efficient test whether a given page range contains
>+ * a particular value. Therefore, if we summarize each page range into a
>+ * bloom filter, we can easily and cheaply test wheter it containst values
>+ * we get later.
>s/wheter it containst/whether it contains/

OK, will reword.

>+ * The index only supports equality operator, similarly to hash indexes.

Hmmm, are there really multiple equality operators?

>+ * The number of distinct elements (in a page range) depends on the data,
>+ * we can consider it fixed. This simplifies the trade-off to just false
>+ * positive rate vs. size.
>Sounds like the first sentence should start with "although".

Yeah, probably. Or maybe there should be "but" at the beginning of the
second sentence.

>+ * of distinct items to be stored in the filter. We can't quite the input
>+ * data, of course, but we may make the BRIN page ranges smaller - instead
>I think you accidentally a word.

Seems like that.

>+ * Of course, good sizing decisions depend on having the necessary data,
>+ * i.e. number of distinct values in a page range (of a given size) and
>+ * table size (to estimate cost change due to change in false positive
>+ * rate due to having larger index vs. scanning larger indexes). We may
>+ * not have that data - for example when building an index on empty table
>+ * it's not really possible. And for some data we only have estimates for
>+ * the whole table and we can only estimate per-range values (ndistinct).
>+ * The current logic, implemented in brin_bloom_get_ndistinct, attempts to
>+ * make some basic sizing decisions, based on the table ndistinct estimate.
>+ * XXX We might also fetch info about ndistinct estimate for the column,
>+ * and compute the expected number of distinct values in a range. But
>Maybe I'm missing something, but the first two comments don't match
>the last one -- I don't see where we get table ndistinct, which I take
>to mean from the stats catalog?

Ah, right. The part suggesting we're looking at the table n_distinct
estimate is obsolete - some older version of the patch attempted to do
that, but I decided to remove it at some point. We can add it in the
future, but I'll fix the comment for now.

>+ * To address these issues, the opclass stores the raw values directly, and
>+ * only switches to the actual bloom filter after reaching the same space
>+ * requirements.
>IIUC, it's after reaching a certain size (BLOOM_MAX_UNSORTED * 4), so
>"same" doesn't make sense here.

Ummm, no. BLOOM_MAX_UNSORTED has nothing to do with the switch from
sorted mode to hashing (which is storing an actual bloom filter).

BLOOM_MAX_UNSORTED only determines number of new items that may not
be sorted - we don't sort after each insertion, but only once in a while
to amortize the costs. So for example you may have 1000 sorted values
and then we allow adding 32 new ones before sorting the array again
(using a merge sort). But all of this is in the "sorted" mode.

The number of items the comment refers to is this:

/* how many uint32 hashes can we fit into the bitmap */
int maxvalues = filter->nbits / (8 * sizeof(uint32));

where nbits is the size of the bloom filter. So I think the "same" is
quite right here.

>+ /*
>+ * The 1% value is mostly arbitrary, it just looks nice.
>+ */
>+#define BLOOM_DEFAULT_FALSE_POSITIVE_RATE 0.01 /* 1% fp rate */
>I think we'd want better stated justification for this default, even
>if just precedence in other implementations. Maybe I can test some
>other values here?

Well, I don't know how to pick a better default :-( Ultimately it's a
tarde-off between larger indexes and scanning larger fraction of a table
due to lower false positive. Then there's the restriction that the whole
index tuple needs to fit into a single 8kB page.

>+ * XXX Perhaps we could save a few bytes by using different data types, but
>+ * considering the size of the bitmap, the difference is negligible.
>Yeah, I think it's obvious enough to leave out.
>+ m = ceil((ndistinct * log(false_positive_rate)) / log(1.0 /
>(pow(2.0, log(2.0)))));
>I find this pretty hard to read and pgindent is going to mess it up
>further. I would have a comment with the formula in math notation
>(note that you can dispense with the reciprocal and just use
>negation), but in code fold the last part to a constant. That might go
>against project style, though:
>m = ceil(ndistinct * log(false_positive_rate) * -2.08136);

Hmm, maybe. I've mostly just copied this out from some bloom filter
paper, but maybe it's not readable.

>+ * XXX Maybe the 64B min size is not really needed?
>Something to figure out before commit?

Probably. I think this optimization is somewhat pointless and we should
just allocate the right amount of space, and repalloc if needed.

>+ /* assume 'not updated' by default */
>+ Assert(filter);

I think they are not related, although the formatting might make it seem
like that.

>I don't see how these are related.
>+ big_h = ((uint32) DatumGetInt64(hash_uint32(value)));
>I'm curious about the Int64 part -- most callers use the bare value or
>with DatumGetUInt32().

Yeah, that formula should use DatumGetUInt32.

>Also, is there a reference for the algorithm for hash values that
>follows? I didn't see anything like it in my cursory reading of the
>topic. Might be good to include it in the comments.

This was suggested by Yura Sokolov [1] in 2017. The post refers to a
paper [2] but I don't recall which part describes "our" algorithm.


>+ * Tweak the ndistinct value based on the pagesPerRange value. First,
>Nitpick: "Tweak" to my mind means to adjust an existing value. The
>above is only true if ndistinct is negative, but we're really not
>tweaking, but using it as a scale factor. Otherwise it's not adjusted,
>only clamped.

OK. Perhaps 'adjust' would be a better term?

>+ * XXX We can only do this when the pagesPerRange value was supplied.
>+ * If it wasn't, it has to be a read-only access to the index, in which
>+ * case we don't really care. But perhaps we should fall-back to the
>+ * default pagesPerRange value?
>I don't understand this.

IIRC I thought there were situations when pagesPerRange value is not
defined, e.g. in read-only access. But I'm not sure about this, and
cosidering the code actally does not check for that (in fact, it has an
assert enforcing valid block number) I think it's a stale comment.

>+static double
>+brin_bloom_get_fp_rate(BrinDesc *bdesc, BloomOptions *opts)
>+ return BloomGetFalsePositiveRate(opts);
>The body of the function is just a macro not used anywhere else -- is
>there a reason for having the macro? Also, what's the first parameter

No reason. I think the function used to be more complicated at some
point, but it got simpler.

>Similarly, BloomGetNDistinctPerRange could just be inlined or turned
>into a function for readability.

Same story.

>+ * or null if it is not exists.
>s/is not exists/does not exist/
>+ /*
>+ * XXX not sure the detoasting is necessary (probably not, this
>+ * can only be in an index).
>+ */
>Something to find out before commit?
>+ /* TODO include the sorted/unsorted values */

This was simplemented as part of the discussion about pageinspect, and
I wanted some confirmation if the approach is acceptable or not before
spending more time on it. Also, the values are really just hashes of the
column values, so I'm not quite sure it makes sense to include that.
What would you suggest?


Tomas Vondra
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-Pass-all-scan-keys-to-BRIN-consistent-funct-20200917.patch text/plain 23.5 KB
0002-Move-IS-NOT-NULL-handling-from-BRIN-support-20200917.patch text/plain 23.3 KB
0003-optimize-allocations-20200917.patch text/plain 3.2 KB
0004-BRIN-bloom-indexes-20200917.patch text/plain 138.7 KB
0005-BRIN-minmax-multi-indexes-20200917.patch text/plain 217.3 KB
0006-Ignore-correlation-for-new-BRIN-opclasses-20200917.patch text/plain 4.1 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-09-17 16:50:00 Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away
Previous Message Tom Lane 2020-09-17 16:04:57 Re: BUG #15858: could not stat file - over 4GB