Re: WIP: BRIN multi-range indexes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: BRIN multi-range indexes
Date: 2021-01-23 02:59:14
Message-ID: 2e903689-c518-f77d-551a-5e3e279d0838@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 1/23/21 12:27 AM, John Naylor wrote:
> On Thu, Jan 21, 2021 at 9:06 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com <mailto:tomas(dot)vondra(at)enterprisedb(dot)com>>
> wrote:
> > [wip optimizations]
>
> > The pathological case (monotonic-asc) is now 4x faster, roughly equal to
> > regular minmax index build. The opposite (monotonic-desc) is about 33%
> > faster, roughly in line with btree.
>
> Those numbers look good. I get similar results, shown below. I've read
> 0007-8 briefly but not in depth.
> > > There are a couple cases where it's actually a bit slower - those are
> > the cases with very few distinct values per range. I believe this
> > happens because in the batch mode the code does not check if the summary
> > already contains this value, adds it to the buffer and the last step
> > ends up being more expensive than this.
>
> I think if it's worst case a bit faster than btree and best case a bit
> slower than traditional minmax, that's acceptable.
>
> > I believe there's some "compromise" between those two extremes, i.e. we
> > should use buffer that is too small or too large, but something in
> > between, so that the reduction happens once in a while but not too often
> > (as with the original aggressive approach).
>
> This sounds good also.
>

Yeah, I agree.

I think the reason why some of the cases got a bit slower is that in
those cases the original approach (ranges being built fairly frequently,
not just once at the end) we quickly built something that represented
the whole range, so adding a new value was often no-op. The add_value
callback found the range already "includes" the new value, etc.

With the batch mode, that's no longer true - we accumulate everything,
so we have to sort it etc. Which I guess may be fairly expensive, thanks
to calling comparator functions etc. I wonder if this could be optimized
a bit, e.g. by first "deduplicating" the values using memcmp() or so.

But ultimately, I think the right solution will be to limit the buffer
size to something like 10x the target, and roll with that. Typically,
increasing the buffer size from e.g. 100B to 1000B brings much clearer
improvement than increasing it from 1000B to 10000B. I'd bet this
follows the pattern.

> > FWIW, none of this is likely to be an issue in practice, because (a)
> > tables usually don't have such strictly monotonic patterns, (b) people
> > should stick to plain minmax for cases that do.
>
> Still, it would be great if multi-minmax can be a drop in replacement. I
> know there was a sticking point of a distance function not being
> available on all types, but I wonder if that can be remedied or worked
> around somehow.
>

Hmm. I think Alvaro also mentioned he'd like to use this as a drop-in
replacement for minmax (essentially, using these opclasses as the
default ones, with the option to switch back to plain minmax). I'm not
convinced we should do that - though. Imagine you have minmax indexes in
your existing DB, it's working perfectly fine, and then we come and just
silently change that during dump/restore. Is there some past example
when we did something similar and it turned it to be OK?

As for the distance functions, I'm pretty sure there are data types
without "natural" distance - like most strings, for example. We could
probably invent something, but the question is how much we can rely on
it working well enough in practice.

Of course, is minmax even the right index type for such data types?
Strings are usually "labels" and not queried using range queries,
although sometimes people encode stuff as strings (but then it's very
unlikely we'll define the distance definition well). So maybe for those
types a hash / bloom would be a better fit anyway.

But I do have an idea - maybe we can do without distances, in those
cases. Essentially, the primary issue of minmax indexes are outliers, so
what if we simply sort the values, keep one range in the middle and as
many single points on each tail?

Imagine we have N values, and we want to represent this by K values. We
simply sort the N values, keep (k-2)/2 values on each tail as outliers,
and use 2 values for the values in between.

Example:

input: [1, 2, 100, 110, 111, ..., 120, , ..., 130, 201, 202]

Given k = 6, we would keep 2 values on tails, and range for the rest:

[1, 2, (100, 130), 201, 202]

Of course, this does not optimize for the same thing as when we have
distance - in that case we try to minimize the "covering" of the input
data, something like

sum(length(r) for r in ranges) / (max(ranges) - min(ranges))

But maybe it's good enough when there's no distance function ...

> And (c) regular tables
> > tend to have much wider rows, so there are fewer values per range (so
> > that other stuff is likely more expensive than building BRIN).
>
> True. I'm still puzzled that it didn't help to use 8 pages per range,
> but it's moot now.
>

I'd bet that even with just 8 pages, there were quite a few values in
the range - possibly hundreds per page. I haven't tried if the patches
help with smaller ranges, so maybe we should check.

>
> Here are some numbers (median of 3) with a similar scenario as before,
> repeated here with some added details. I didn't bother with what you
> call "unpatched":
>
>                btree   minmax   multi
> monotonic-asc  44.4    26.5     27.8
> mono-del-ins   38.7    24.6     30.4
> mono-10-asc    61.8    25.6     33.5
>

Thanks. Those numbers seem reasonable.

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ajin Cherian 2021-01-23 03:07:24 Re: Single transaction in the tablesync worker?
Previous Message Michael Paquier 2021-01-23 02:37:23 Re: Some more hackery around cryptohashes (some fixes + SHA1)