Re: PATCH: Using BRIN indexes for sorted output

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PATCH: Using BRIN indexes for sorted output
Date: 2022-10-16 14:42:39
Message-ID: CALNJ-vSXvkBQTSVwoN=Q3ukW_UomEenPA-qyEXkoJTRU1r2prg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 16, 2022 at 7:33 AM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

>
>
> On 10/16/22 16:01, Zhihong Yu wrote:
> >
> >
> > On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> > <tomas(dot)vondra(at)enterprisedb(dot)com <mailto:tomas(dot)vondra(at)enterprisedb(dot)com>>
> > wrote:
> >
> > On 10/15/22 14:33, Tomas Vondra wrote:
> > > Hi,
> > >
> > > ...
> > >
> > > There's a bunch of issues with this initial version of the patch,
> > > usually described in XXX comments in the relevant places.6)
> > >
> > > ...
> >
> > I forgot to mention one important issue in my list yesterday, and
> that's
> > memory consumption. The way the patch is coded now, the new BRIN
> support
> > function (brin_minmax_ranges) produces information about *all*
> ranges in
> > one go, which may be an issue. The worst case is 32TB table, with
> 1-page
> > BRIN ranges, which means ~4 billion ranges. The info is an array of
> ~32B
> > structs, so this would require ~128GB of RAM. With the default
> 128-page
> > ranges, it's still be ~1GB, which is quite a lot.
> >
> > We could have a discussion about what's the reasonable size of BRIN
> > ranges on such large tables (e.g. building a bitmap on 4 billion
> ranges
> > is going to be "not cheap" so this is likely pretty rare). But we
> should
> > not introduce new nodes that ignore work_mem, so we need a way to
> deal
> > with such cases somehow.
> >
> > The easiest solution likely is to check this while planning - we can
> > check the table size, calculate the number of BRIN ranges, and check
> > that the range info fits into work_mem, and just not create the path
> > when it gets too large. That's what we did for HashAgg, although that
> > decision was unreliable because estimating GROUP BY cardinality is
> hard.
> >
> > The wrinkle here is that counting just the range info (BrinRange
> struct)
> > does not include the values for by-reference types. We could use
> average
> > width - that's just an estimate, though.
> >
> > A more comprehensive solution seems to be to allow requesting chunks
> of
> > the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> > those. So for example if you have 1000 ranges, and you can only
> handle
> > 100 at a time, we'd do 10 loops, each requesting 100 ranges.
> >
> > This has another problem - we do care about "overlaps", and we can't
> > really know if the overlapping ranges will be in the same "slice"
> > easily. The chunks would be sorted (for example) by maxval. But there
> > can be a range with much higher maxval (thus in some future slice),
> but
> > very low minval (thus intersecting with ranges in the current slice).
> >
> > Imagine ranges with these minval/maxval values, sorted by maxval:
> >
> > [101,200]
> > [201,300]
> > [301,400]
> > [150,500]
> >
> > and let's say we can only process 2-range slices. So we'll get the
> first
> > two, but both of them intersect with the very last range.
> >
> > We could always include all the intersecting ranges into the slice,
> but
> > what if there are too many very "wide" ranges?
> >
> > So I think this will need to switch to an iterative communication
> with
> > the BRIN index - instead of asking "give me info about all the
> ranges",
> > we'll need a way to
> >
> > - request the next range (sorted by maxval)
> > - request the intersecting ranges one by one (sorted by minval)
> >
> > Of course, the BRIN side will have some of the same challenges with
> > tracking the info without breaking the work_mem limit, but I suppose
> it
> > can store the info into a tuplestore/tuplesort, and use that instead
> of
> > plain in-memory array. Alternatively, it could just return those, and
> > BrinSort would use that. OTOH it seems cleaner to have some sort of
> API,
> > especially if we want to support e.g. minmax-multi opclasses, that
> have
> > a more complicated concept of "intersection".
> >
> >
> > regards
> >
> > --
> > Tomas Vondra
> > EnterpriseDB: http://www.enterprisedb.com <
> http://www.enterprisedb.com>
> > The Enterprise PostgreSQL Company
> >
> > Hi,
> > In your example involving [150,500], can this range be broken down into
> > 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> > That way, there is no intersection among the ranges.
> >
>
> Not really, I think. These "value ranges" map to "page ranges" and how
> would you split those? I mean, you know values [150,500] map to blocks
> [0,127]. You split the values into [150,200], [201,300], [301,400]. How
> do you split the page range [0,127]?
>
> Also, splitting a range into more ranges is likely making the issue
> worse, because it increases the number of ranges, right? And I mean,
> much worse, because imagine a "wide" range that overlaps with every
> other range - the number of ranges would explode.
>
> It's not clear to me at which point you'd make the split. At the
> beginning, right after loading the ranges from BRIN index? A lot of that
> may be unnecessary, in case the range is loaded as a "non-intersecting"
> range.
>
> Try to formulate the whole algorithm. Maybe I'm missing something.
>
> The current algorithm is something like this:
>
> 1. request info about ranges from the BRIN opclass
> 2. sort them by maxval and minval
> 3. NULLS FIRST: read all ranges that might have NULLs => output
> 4. read the next range (by maxval) into tuplesort
> (if no more ranges, go to (9))
> 5. load all tuples from "splill" tuplestore, compare to maxval
> 6. load all tuples from no-summarized ranges (first range only)
> (into tuplesort/tuplestore, depending on maxval comparison)
> 7. load all intersecting ranges (with minval < current maxval)
> (into tuplesort/tuplestore, depending on maxval comparison)
> 8. sort the tuplesort, output all tuples, then back to (4)
> 9. NULLS LAST: read all ranges that might have NULLs => output
> 10. done
>
> For "DESC" ordering the process is almost the same, except that we swap
> minval/maxval in most places.
>
> Hi,
Thanks for the quick reply.

I don't have good answer w.r.t. splitting the page range [0,127] now. Let
me think more about it.

The 10 step flow (subject to changes down the road) should be either given
in the description of the patch or, written as comment inside the code.
This would help people grasp the concept much faster.

BTW splill seems to be a typo - I assume you meant spill.

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2022-10-16 15:48:31 Re: PATCH: Using BRIN indexes for sorted output
Previous Message Tom Lane 2022-10-16 14:41:57 Re: PATCH: Using BRIN indexes for sorted output