Re: Parallel CREATE INDEX for BRIN indexes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for BRIN indexes
Date: 2023-11-22 19:16:55
Message-ID: fff0b3f5-d5a7-fcef-25bb-ac0427188008@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/20/23 20:48, Matthias van de Meent wrote:
> On Wed, 8 Nov 2023 at 12:03, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> Hi,
>>
>> here's an updated patch, addressing the review comments, and reworking
>> how the work is divided between the workers & leader etc.
>>
>> 0001 is just v2, rebased to current master
>>
>> 0002 and 0003 address most of the issues, in particular it
>>
>> - removes the unnecessary spool
>> - fixes bs_reltuples type to double
>> - a couple comments are reworded to be clearer
>> - changes loop/condition in brinbuildCallbackParallel
>> - removes asserts added for debugging
>> - fixes cast in comparetup_index_brin
>> - 0003 then simplifies comparetup_index_brin
>> - I haven't inlined the tuplesort_puttuple_common parameter
>> (didn't seem worth it)
>
> OK, thanks
>
>> 0004 Reworks how the work is divided between workers and combined by the
>> leader. It undoes the tableam.c changes that attempted to divide the
>> relation into chunks matching the BRIN ranges, and instead merges the
>> results in the leader (using the BRIN "union" function).
>
> That's OK.
>
>> I haven't done any indentation fixes yet.
>>
>> I did fairly extensive testing, using pageinspect to compare indexes
>> built with/without parallelism. More testing is needed, but it seems to
>> work fine (with other opclasses and so on).
>
> After code-only review, here are some comments:
>
>> +++ b/src/backend/access/brin/brin.c
>> [...]
>> +/* Magic numbers for parallel state sharing */
>> +#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xA000000000000001)
>> +#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
>
> These shm keys use the same constants also in use in
> access/nbtree/nbtsort.c. While this shouldn't be an issue in normal
> operations, I'd prefer if we didn't actively introduce conflicting
> identifiers when we still have significant amounts of unused values
> remaining.
>

Hmmm. Is there some rule of thumb how to pick these key values? I see
nbtsort.c uses 0xA prefix, execParallel.c uses 0xE, while parallel.c
ended up using 0xFFFFFFFFFFFF as prefix. I've user 0xB, simply because
BRIN also starts with B.

>> +#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000003)
>
> This is the fourth definition of a PARALLEL%_KEY_QUERY_TEXT, the
> others being in access/nbtree/nbtsort.c (value 0xA000000000000004, one
> more than brin's), backend/executor/execParallel.c
> (0xE000000000000008), and PARALLEL_VACUUM_KEY_QUERY_TEXT (0x3) (though
> I've not checked that their uses are exactly the same, I'd expect at
> least btree to match mostly, if not fully, 1:1).
> I think we could probably benefit from a less ad-hoc sharing of query
> texts. I don't think that needs to happen specifically in this patch,
> but I think it's something to keep in mind in future efforts.
>

I'm afraid I don't quite get what you mean by "ad hoc sharing of query
texts". Are you saying we shouldn't propagate the query text to the
parallel workers? Why? Or what's the proper solution?

>> +_brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
>> [...]
>> + BrinSpool *spool = state->bs_spool;
>> [...]
>> + if (!state)
>> + return;
>
> I think the assignment to spool should be moved to below this
> condition, as _brin_begin_parallel calls this with state=NULL when it
> can't launch parallel workers, which will cause issues here.
>

Good catch! I wonder if we have tests that might trigger this, say by
setting max_parallel_maintenance_workers > 0 while no workers allowed.

>> + state->bs_numtuples = brinshared->indtuples;
>
> My IDE complains about bs_numtuples being an integer. This is a
> pre-existing issue, but still valid: we can hit an overflow on tables
> with pages_per_range=1 and relsize >= 2^31 pages. Extremely unlikely,
> but problematic nonetheless.
>

True. I think I've been hesitant to make this a double because it seems
a bit weird to do +1 with a double, and at some point (d == d+1). But
this seems safe, we're guaranteed to be far away from that threshold.

>> + * FIXME This probably needs some memory management fixes - we're reading
>> + * tuples from the tuplesort, we're allocating an emty tuple, and so on.
>> + * Probably better to release this memory.
>
> This should probably be resolved.
>

AFAICS that comment is actually inaccurate/stale, sorry about that. The
code actually allocates (and resets) a single memtuple, and also
emptyTuple. So I think that's OK, I've removed the comment.

> I also noticed that this is likely to execute `union_tuples` many
> times when pages_per_range is coprime with the parallel table scan's
> block stride (or when we for other reasons have many tuples returned
> for each range); and this `union_tuples` internally allocates and
> deletes its own memory context for its deserialization of the 'b'
> tuple. I think we should just pass a scratch context instead, so that
> we don't have the overhead of continously creating then deleting the
> same memory context

Perhaps. Looking at the code, isn't it a bit strange how union_tuples
uses the context? It creates the context, calls brin_deform_tuple in
that context, but then the rest of the function (including datumCopy and
similar stuff) happens in the caller's context ...

However, I don't think the number of union_tuples calls is likely to be
very high, especially for large tables. Because we split the table into
2048 chunks, and then cap the chunk size by 8192. For large tables
(where this matters) we're likely close to 8192.

>
>> +++ b/src/backend/catalog/index.c
>> [...]
>> - indexRelation->rd_rel->relam == BTREE_AM_OID)
>> + (indexRelation->rd_rel->relam == BTREE_AM_OID ||
>> + indexRelation->rd_rel->relam == BRIN_AM_OID))
>
> I think this needs some more effort. I imagine a new
> IndexAmRoutine->amcanbuildparallel is more appropriate than this
> hard-coded list - external indexes may want to utilize the parallel
> index creation planner facility, too.
>

Good idea. I added the IndexAmRoutine flag and used it here.

>
> Some notes:
> As a project PostgreSQL seems to be trying to move away from
> hardcoding heap into everything in favour of the more AM-agnostic
> 'table'. I suggest replacing all mentions of "heap" in the arguments
> with "table", to reduce the work future maintainers need to do to fix
> this. Even when this AM is mostly targetted towards the heap AM, other
> AMs (such as those I've heard of that were developed internally at
> EDB) use the same block-addressing that heap does, and should thus be
> compatible with BRIN. Thus, "heap" is not a useful name here.
>

I'm not against doing that, but I'd prefer to do that in a separate
patch. There's a bunch of preexisting heap references, so and I don't
want to introduce inconsistency (patch using table, old code heap) nor
do I want to tweak unrelated code.

> There are 2 new mentions of "tuplestore" in the patch, while the
> structure used is tuplesort: one on form_and_spill_tuple, and one on
> brinbuildCallbackParallel. Please update those comments.
>
> That's it for code review. I'll do some performance comparisons and
> testing soon, too.
>

Thanks! Attached is a patch squashing the previous version into a single
v3 commit, with fixes for your review in a separate commit.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v4-0001-parallel-CREATE-INDEX-for-BRIN-v3.patch text/x-patch 44.3 KB
v4-0002-review-fixes.patch text/x-patch 9.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-11-22 19:38:48 Re: remaining sql/json patches
Previous Message Nathan Bossart 2023-11-22 18:49:35 Re: autovectorize page checksum code included elsewhere