Re: PATCH: Using BRIN indexes for sorted output

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Greg Stark <stark(at)mit(dot)edu>, Zhihong Yu <zyu(at)yugabyte(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: PATCH: Using BRIN indexes for sorted output
Date: 2023-02-27 13:58:51
Message-ID: CAEze2WhUyFWc8P2Z30f9rqfo-NKTQzVK4NqTaj15QzeAu5y9LQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 24 Feb 2023, 20:14 Tomas Vondra, <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 2/24/23 19:03, Matthias van de Meent wrote:
> > On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
> >> Yeah, that sounds like a bug. Also a sign the tests should have some
> >> by-ref data types (presumably there are none, as that would certainly
> >> trip some asserts etc.).
> >
> > I'm not sure we currently trip asserts, as the data we store in the
> > memory context is very limited, making it unlikely we actually release
> > the memory region back to the OS.
> > I did get assertion failures by adding the attached patch, but I don't
> > think that's something we can do in the final release.
> >
>
> But we should randomize the memory if we ever do pfree(), and it's
> strange valgrind didn't complain when I ran tests with it.

Well, we don't clean up the decoding tuple immediately after our last
iteration, so the memory context (and the last tuple's attributes) are
still valid memory addresses. And, assuming that min/max values for
all brin ranges all have the same max-aligned length, the attribute
pointers are likely to point to the same offset within the decoding
tuple's memory context's memory segment, which would mean the dangling
pointers still point to valid memory - just not memory with the
contents they expected to be pointing to.

> >> Considering how tiny BRIN indexes are, this is likely orders of
> >> magnitude less I/O than we expend on sampling rows from the table. I
> >> mean, with the default statistics target we read ~30000 pages (~240MB)
> >> or more just to sample the rows. Randomly, while the BRIN index is
> >> likely scanned mostly sequentially.
> >
> > Mostly agreed; except I think it's not too difficult to imagine a BRIN
> > index that is on that scale; with as an example the bloom filters that
> > easily take up several hundreds of bytes.
> >
> > With the default configuration of 128 pages_per_range,
> > n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
> > bloom filter size is 4.36 KiB - each indexed item on its own page. It
> > is still only 1% of the original table's size, but there are enough
> > tables that are larger than 24GB that this could be a significant
> > issue.
> >
>
> Right, it's certainly true BRIN indexes may be made fairly large (either
> by using something like bloom or by making the ranges much smaller). But
> those are exactly the indexes where building statistics for all columns
> at once is going to cause issues with memory usage etc.
>
> Note: Obviously, that depends on how much data per range we need to keep
> in memory. For bloom I doubt we'd actually want to keep all the filters,
> we'd probably calculate just some statistics (e.g. number of bits set),
> so maybe the memory consumption is not that bad.

Yes, I was thinking something along the same lines for bloom as well.
Something like 'average number of bits set' (or: histogram number of
bits set), and/or for each bit a count (or %) how many times it is
set, etc.

> >> Maybe there are cases where this would be an issue, but I haven't seen
> >> one when working on this patch (and I did a lot of experiments). I'd
> >> like to see one before we start optimizing it ...
> >
> > I'm not only worried about optimizing it, I'm also worried that we're
> > putting this abstraction at the wrong level in a way that is difficult
> > to modify.
> >
>
> Yeah, that's certainly a valid concern. I admit I only did the minimum
> amount of work on this part, as I was focused on the sorting part.
>
> >> This also reminds me that the issues I actually saw (e.g. memory
> >> consumption) would be made worse by processing all columns at once,
> >> because then you need to keep more columns in memory.
> >
> > Yes, I that can be a valid concern, but don't we already do similar
> > things in the current table statistics gathering?
> >
>
> Not really, I think. We sample a bunch of rows once, but then we build
> statistics on this sample for each attribute / expression independently.
> We could of course read the whole index into memory and then run the
> analysis, but I think tuples tend to be much smaller (thanks to TOAST
> etc.) and we only really scan a limited amount of them (30k).

Just to note, with default settings, sampling 30k index entries from
BRIN would cover ~29 GiB of a table. This isn't a lot, but it also
isn't exactly a small table. I think that it would be difficult to get
accurate avg_overlap statistics for some shapes of BRIN data...

> But if we're concerned about the I/O, the BRIN is likely fairly large,
> so maybe reading it into memory at once is not a great idea.

Agreed, we can't always expect that the interesting parts of the BRIN
index always fit in the available memory.

Kind regards,

Matthias van de Meent

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-02-27 14:11:53 Re: Should vacuum process config file reload more often
Previous Message Heikki Linnakangas 2023-02-27 13:56:33 Re: SLRUs in the main buffer pool - Page Header definitions