Re: Parallel CREATE INDEX for BRIN indexes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for BRIN indexes
Date: 2023-12-04 15:33:09
Message-ID: 32e3f645-389d-87f9-2ce7-33d0087f47d4@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/4/23 16:00, Matthias van de Meent wrote:
> On Sun, 3 Dec 2023 at 17:46, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> On 11/30/23 18:47, Matthias van de Meent wrote:
>>> ...
>>>
>>> I just ran some more tests in less favorable environments, and it
>>> looks like I hit a bug:
>>>
>>> % SET max_parallel_workers = 0;
>>> % CREATE INDEX ... USING brin (...);
>>> ERROR: cannot update tuples during a parallel operation
>>>
>>> Fix attached in 0002.
>>
>> Yeah, that's a bug, thanks for the fix. Yeah Just jumping to a "cleanup"
>> label seems a bit cleaner (if that can be said about using goto), so I
>> tweaked the patch to do that instead.
>
> Good point, I agree that's cleaner.
>
>>> In 0003 I add the mentioned backfilling of empty ranges at the end of
>>> the table. I added it for both normal and parallel index builds, as
>>> normal builds apparently also didn't yet have this yet.
>>>
>>
>> Right. I was thinking about doing that to, but you beat me to it. I
>> don't want to bury this in the main patch adding parallel builds, it's
>> not really related to parallel CREATE INDEX. And it'd be weird to have
>> this for parallel builds first, so I rebased it as 0001.
>
> OK.
>
>> As for the backfilling, I think we need to simplify the code a bit.
>>
>> So 0004 simplifies this - the backfilling is done by a function called
>> from all the places. The main complexity is in ensuring all three places
>> have the same concept of how to specify the range (of ranges) to fill.
>
> Good points, +1. However, the simplification in 0005 breaks that with
> an underflow:
>
>> @@ -1669,6 +1672,19 @@ initialize_brin_buildstate(Relation idxRel, BrinRevmap *revmap,
>> state->bs_worker_id = 0;
>> state->bs_spool = NULL;
>>
>> + /*
>> + * Calculate the start of the last page range. Page numbers are 0-based,
>> + * so to get the index of the last page we need to subtract one. Then the
>> + * integer division gives us the proper 0-based range index.
>> + */
>> + state->bs_maxRangeStart = ((tablePages - 1) / pagesPerRange) * pagesPerRange;
>
> When the table is empty, this will try to fill all potential ranges up
> to InvalidBlockNo's range, which is obviously invalid. It also breaks
> the regression tests, as showin in CFBot.
>

Whoooops! You're right, ofc. If it's empty, we should use 0 instead.
That's what we do now anyway, BRIN will have the first range even for
empty tables.

>> skipping the last page range?
>> -----------------------------
>>
>> I noticed you explicitly skipped backfilling empty tuple for the last
>> page range. Can you explain? I suspect the idea was that the user
>> activity would trigger building the tuple once that page range is
>> filled, but we don't really know if the table receives any changes. It
>> might easily be just a static table, in which case the last range would
>> remain unsummarized. If this is the right thing to do, the serial build
>> should do that too probably ...
>>
>> But I don't think that's the correct thing to do - I think CREATE INDEX
>> is expected to always build a complete index, so my version always
>> builds an index for all table pages.
>
> Hmm. My idea here is to create an index that is closer to what you get
> when you hit the insertion path with aminsert. This isn't 1:1 how the
> index builds ranges during (re)index when there is data for that
> range, but I thought it to be a close enough analog. Either way, I
> don't mind it adding an empty range for the last range if that's
> considered useful.
>

I understand, but I'm not sure if keeping this consistency with aminsert
has any material benefit. It's not like we do that now, I think (for
empty tables we already build the first index range).

>> BlockNumber overflows
>> ---------------------
>>
>> The one thing that I'm not quite sure is correct is whether this handles
>> overflows/underflows correctly. I mean, imagine you have a huge table
>> that's almost 0xFFFFFFFF blocks, pages_per_range is prime, and the last
>> range ends less than pages_per_range from 0xFFFFFFFF. Then this
>>
>> blkno += pages_per_range;
>>
>> can overflow, and might start inserting index tuples again (so we'd end
>> up with a duplicate).
>>
>> I do think the current patch does this correctly, but AFAICS this is a
>> pre-existing issue ...
>
> Yes, I know I've flagged this at least once before. IIRC, the response
> back then was that it's a very unlikely issue, as you'd have to extend
> the relation to at least the first block of the last range, which
> would currently be InvalidBlockNo - 131072 + 1, or just shy of 32TB of
> data at 8kB BLCKSZ. That's not exactly a common use case, and BRIN
> range ID wraparound is likely the least of your worries at that point.
>

Probably true, but it seems somewhat careless and untidy ...

>> Anyway, while working on this / stress-testing it, I realized there's a
>> bug in how we allocate the emptyTuple. It's allocated lazily, but if can
>> easily happen in the per-range context we introduced last week. It needs
>> to be allocated in the context covering the whole index build.
>
> Yeah, I hadn't tested with (very) sparse datasets yet.
>

I haven't actually checked what the failing cases look like, but I don't
think it needs to be particularly sparse. AFAIK it's just that the
script deletes a chunk of the data somewhere in the table and/or it also
creates a partial index.

>> I think the best way to do that is per 0006, i.e. allocate it in the
>> BrinBuildState, along with the appropriate memory context.
>
> That fix looks fine to me.
>

Thanks!

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2023-12-04 15:45:58 Re: Emitting JSON to file using COPY TO
Previous Message Peter Eisentraut 2023-12-04 15:22:31 Re: Clean up some signal usage mainly related to Windows