Re: Parallel CREATE INDEX for BRIN indexes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for BRIN indexes
Date: 2023-12-03 16:46:03
Message-ID: 99f9ecbe-0976-8012-67b5-dbff58c7dbec@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/30/23 18:47, Matthias van de Meent wrote:
> ...
>
> I just ran some more tests in less favorable environments, and it
> looks like I hit a bug:
>
> % SET max_parallel_workers = 0;
> % CREATE INDEX ... USING brin (...);
> ERROR: cannot update tuples during a parallel operation
>
> Fix attached in 0002.

Yeah, that's a bug, thanks for the fix. Yeah Just jumping to a "cleanup"
label seems a bit cleaner (if that can be said about using goto), so I
tweaked the patch to do that instead.

> In 0003 I add the mentioned backfilling of empty ranges at the end of
> the table. I added it for both normal and parallel index builds, as
> normal builds apparently also didn't yet have this yet.
>

Right. I was thinking about doing that to, but you beat me to it. I
don't want to bury this in the main patch adding parallel builds, it's
not really related to parallel CREATE INDEX. And it'd be weird to have
this for parallel builds first, so I rebased it as 0001.

As for the backfilling, I think we need to simplify the code a bit. We
have three places doing essentially the same thing (one for serial
builds, two for parallel builds). That's unnecessarily verbose, and
makes it harder to understand the code. But more importantly, the three
places are not doing exactly the same - some increment the current range
before, some do it at the end of the loop, etc. I got confused by this
multiple times.

So 0004 simplifies this - the backfilling is done by a function called
from all the places. The main complexity is in ensuring all three places
have the same concept of how to specify the range (of ranges) to fill.

Note: The serial might have two places too, but the main loop in
brinbuildCallback() does it range by range. It's a bit less efficient as
it can't use the pre-built empty tuple easily, but that's fine IMO.

skipping the last page range?
-----------------------------

I noticed you explicitly skipped backfilling empty tuple for the last
page range. Can you explain? I suspect the idea was that the user
activity would trigger building the tuple once that page range is
filled, but we don't really know if the table receives any changes. It
might easily be just a static table, in which case the last range would
remain unsummarized. If this is the right thing to do, the serial build
should do that too probably ...

But I don't think that's the correct thing to do - I think CREATE INDEX
is expected to always build a complete index, so my version always
builds an index for all table pages.

BlockNumber overflows
---------------------

The one thing that I'm not quite sure is correct is whether this handles
overflows/underflows correctly. I mean, imagine you have a huge table
that's almost 0xFFFFFFFF blocks, pages_per_range is prime, and the last
range ends less than pages_per_range from 0xFFFFFFFF. Then this

blkno += pages_per_range;

can overflow, and might start inserting index tuples again (so we'd end
up with a duplicate).

I do think the current patch does this correctly, but AFAICS this is a
pre-existing issue ...

Anyway, while working on this / stress-testing it, I realized there's a
bug in how we allocate the emptyTuple. It's allocated lazily, but if can
easily happen in the per-range context we introduced last week. It needs
to be allocated in the context covering the whole index build.

I think the best way to do that is per 0006, i.e. allocate it in the
BrinBuildState, along with the appropriate memory context.

Obviously, all of this (0002-0006) should be squashed into a single
commit, I only keep it separate to make it clearer what changed.

stress-testing script
---------------------

I'm also attaching the bash script I use to stress test this - it's just
a loop that creates somewhat random table (different number of rows,
distinct values, ...), maybe deletes some of it, creates an index
(possibly partial), and then does various checks on it (checks number of
ranges, queries the table, etc.). It's somewhat primitive but it turned
out to be very capable in triggering bugs in BlockNumber arithmetic,
emptyTuple allocations, etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v8-0001-backfill-pages-in-serial-build.patch text/x-patch 4.0 KB
v8-0002-Allow-BRIN-to-build-its-index-in-parallel.patch text/x-patch 53.0 KB
v8-0003-BRIN-Exit-parallel-mode-when-not-starting-paralle.patch text/x-patch 1.4 KB
v8-0004-BRIN-Backfill-empty-ranges.patch text/x-patch 5.2 KB
v8-0005-simplify-filling-empty-ranges.patch text/x-patch 8.7 KB
v8-0006-memcontext-emptyTuple.patch text/x-patch 6.1 KB
run.sh application/x-shellscript 6.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2023-12-03 17:11:34 Re: Emitting JSON to file using COPY TO
Previous Message Andrew Dunstan 2023-12-03 16:14:15 Re: More new SQL/JSON item methods