Re: PATCH: Using BRIN indexes for sorted output

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PATCH: Using BRIN indexes for sorted output
Date: 2022-10-16 14:01:33
Message-ID: CALNJ-vRc+2SosPz9EehuF1xoQT4s4YE3SEuKyFjdAwCSfq6sGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> On 10/15/22 14:33, Tomas Vondra wrote:
> > Hi,
> >
> > ...
> >
> > There's a bunch of issues with this initial version of the patch,
> > usually described in XXX comments in the relevant places.6)
> >
> > ...
>
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption. The way the patch is coded now, the new BRIN support
> function (brin_minmax_ranges) produces information about *all* ranges in
> one go, which may be an issue. The worst case is 32TB table, with 1-page
> BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
> structs, so this would require ~128GB of RAM. With the default 128-page
> ranges, it's still be ~1GB, which is quite a lot.
>
> We could have a discussion about what's the reasonable size of BRIN
> ranges on such large tables (e.g. building a bitmap on 4 billion ranges
> is going to be "not cheap" so this is likely pretty rare). But we should
> not introduce new nodes that ignore work_mem, so we need a way to deal
> with such cases somehow.
>
> The easiest solution likely is to check this while planning - we can
> check the table size, calculate the number of BRIN ranges, and check
> that the range info fits into work_mem, and just not create the path
> when it gets too large. That's what we did for HashAgg, although that
> decision was unreliable because estimating GROUP BY cardinality is hard.
>
> The wrinkle here is that counting just the range info (BrinRange struct)
> does not include the values for by-reference types. We could use average
> width - that's just an estimate, though.
>
> A more comprehensive solution seems to be to allow requesting chunks of
> the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> those. So for example if you have 1000 ranges, and you can only handle
> 100 at a time, we'd do 10 loops, each requesting 100 ranges.
>
> This has another problem - we do care about "overlaps", and we can't
> really know if the overlapping ranges will be in the same "slice"
> easily. The chunks would be sorted (for example) by maxval. But there
> can be a range with much higher maxval (thus in some future slice), but
> very low minval (thus intersecting with ranges in the current slice).
>
> Imagine ranges with these minval/maxval values, sorted by maxval:
>
> [101,200]
> [201,300]
> [301,400]
> [150,500]
>
> and let's say we can only process 2-range slices. So we'll get the first
> two, but both of them intersect with the very last range.
>
> We could always include all the intersecting ranges into the slice, but
> what if there are too many very "wide" ranges?
>
> So I think this will need to switch to an iterative communication with
> the BRIN index - instead of asking "give me info about all the ranges",
> we'll need a way to
>
> - request the next range (sorted by maxval)
> - request the intersecting ranges one by one (sorted by minval)
>
> Of course, the BRIN side will have some of the same challenges with
> tracking the info without breaking the work_mem limit, but I suppose it
> can store the info into a tuplestore/tuplesort, and use that instead of
> plain in-memory array. Alternatively, it could just return those, and
> BrinSort would use that. OTOH it seems cleaner to have some sort of API,
> especially if we want to support e.g. minmax-multi opclasses, that have
> a more complicated concept of "intersection".
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> Hi,
In your example involving [150,500], can this range be broken down into 4
ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.

bq. can store the info into a tuplestore/tuplesort

Wouldn't this involve disk accesses which may reduce the effectiveness of
BRIN sort ?

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-10-16 14:14:32 Re: PATCH: Using BRIN indexes for sorted output
Previous Message Tomas Vondra 2022-10-16 13:50:53 Re: PATCH: Using BRIN indexes for sorted output