Re: Parallel CREATE INDEX for BRIN indexes

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for BRIN indexes
Date: 2023-12-04 15:00:35
Message-ID: CAEze2WjmoEo9UFpsiq_1sngOnZArjUMjN8wyzcQvegLFDddYUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, 3 Dec 2023 at 17:46, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> On 11/30/23 18:47, Matthias van de Meent wrote:
> > ...
> >
> > I just ran some more tests in less favorable environments, and it
> > looks like I hit a bug:
> >
> > % SET max_parallel_workers = 0;
> > % CREATE INDEX ... USING brin (...);
> > ERROR: cannot update tuples during a parallel operation
> >
> > Fix attached in 0002.
>
> Yeah, that's a bug, thanks for the fix. Yeah Just jumping to a "cleanup"
> label seems a bit cleaner (if that can be said about using goto), so I
> tweaked the patch to do that instead.

Good point, I agree that's cleaner.

> > In 0003 I add the mentioned backfilling of empty ranges at the end of
> > the table. I added it for both normal and parallel index builds, as
> > normal builds apparently also didn't yet have this yet.
> >
>
> Right. I was thinking about doing that to, but you beat me to it. I
> don't want to bury this in the main patch adding parallel builds, it's
> not really related to parallel CREATE INDEX. And it'd be weird to have
> this for parallel builds first, so I rebased it as 0001.

OK.

> As for the backfilling, I think we need to simplify the code a bit.
>
> So 0004 simplifies this - the backfilling is done by a function called
> from all the places. The main complexity is in ensuring all three places
> have the same concept of how to specify the range (of ranges) to fill.

Good points, +1. However, the simplification in 0005 breaks that with
an underflow:

> @@ -1669,6 +1672,19 @@ initialize_brin_buildstate(Relation idxRel, BrinRevmap *revmap,
> state->bs_worker_id = 0;
> state->bs_spool = NULL;
>
> + /*
> + * Calculate the start of the last page range. Page numbers are 0-based,
> + * so to get the index of the last page we need to subtract one. Then the
> + * integer division gives us the proper 0-based range index.
> + */
> + state->bs_maxRangeStart = ((tablePages - 1) / pagesPerRange) * pagesPerRange;

When the table is empty, this will try to fill all potential ranges up
to InvalidBlockNo's range, which is obviously invalid. It also breaks
the regression tests, as showin in CFBot.

> skipping the last page range?
> -----------------------------
>
> I noticed you explicitly skipped backfilling empty tuple for the last
> page range. Can you explain? I suspect the idea was that the user
> activity would trigger building the tuple once that page range is
> filled, but we don't really know if the table receives any changes. It
> might easily be just a static table, in which case the last range would
> remain unsummarized. If this is the right thing to do, the serial build
> should do that too probably ...
>
> But I don't think that's the correct thing to do - I think CREATE INDEX
> is expected to always build a complete index, so my version always
> builds an index for all table pages.

Hmm. My idea here is to create an index that is closer to what you get
when you hit the insertion path with aminsert. This isn't 1:1 how the
index builds ranges during (re)index when there is data for that
range, but I thought it to be a close enough analog. Either way, I
don't mind it adding an empty range for the last range if that's
considered useful.

> BlockNumber overflows
> ---------------------
>
> The one thing that I'm not quite sure is correct is whether this handles
> overflows/underflows correctly. I mean, imagine you have a huge table
> that's almost 0xFFFFFFFF blocks, pages_per_range is prime, and the last
> range ends less than pages_per_range from 0xFFFFFFFF. Then this
>
> blkno += pages_per_range;
>
> can overflow, and might start inserting index tuples again (so we'd end
> up with a duplicate).
>
> I do think the current patch does this correctly, but AFAICS this is a
> pre-existing issue ...

Yes, I know I've flagged this at least once before. IIRC, the response
back then was that it's a very unlikely issue, as you'd have to extend
the relation to at least the first block of the last range, which
would currently be InvalidBlockNo - 131072 + 1, or just shy of 32TB of
data at 8kB BLCKSZ. That's not exactly a common use case, and BRIN
range ID wraparound is likely the least of your worries at that point.

> Anyway, while working on this / stress-testing it, I realized there's a
> bug in how we allocate the emptyTuple. It's allocated lazily, but if can
> easily happen in the per-range context we introduced last week. It needs
> to be allocated in the context covering the whole index build.

Yeah, I hadn't tested with (very) sparse datasets yet.

> I think the best way to do that is per 0006, i.e. allocate it in the
> BrinBuildState, along with the appropriate memory context.

That fix looks fine to me.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-12-04 15:22:31 Re: Clean up some signal usage mainly related to Windows
Previous Message Alexander Lakhin 2023-12-04 15:00:00 Re: PATCH: Add REINDEX tag to event triggers