Re: Parallel CREATE INDEX for BRIN indexes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for BRIN indexes
Date: 2023-11-29 20:56:18
Message-ID: c990c65c-93e9-f9c3-6792-02f929ce308f@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/29/23 21:30, Matthias van de Meent wrote:
> On Wed, 29 Nov 2023 at 18:55, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> On 11/29/23 15:52, Tomas Vondra wrote:
>>>> ...
>>>>
>>>> This also made me think a bit more about how we're working with the
>>>> tuples. With your latest patch, we always deserialize and re-serialize
>>>> the sorted brin tuples, just in case the next tuple will also be a
>>>> BRIN tuple of the same page range. Could we save some of that
>>>> deserialization time by optimistically expecting that we're not going
>>>> to need to merge the tuple and only store a local copy of it locally?
>>>> See attached 0002; this saves some cycles in common cases.
>>>>
>>>
>>> Good idea!
>>>
>>
>> FWIW there's a bug, in this part of the optimization:
>>
>> ------------------
>> + if (memtuple == NULL)
>> + memtuple = brin_deform_tuple(state->bs_bdesc, btup,
>> + memtup_holder);
>> +
>> union_tuples(state->bs_bdesc, memtuple, btup);
>> continue;
>> ------------------
>>
>> The deforming should use prevbtup, otherwise union_tuples() jut combines
>> two copies of the same tuple.
>
> Good point. There were some more issues as well, fixes are attached.
>
>> Which however brings me to the bigger issue with this - my stress test
>> found this issue pretty quickly, but then I spent quite a bit of time
>> trying to find what went wrong. I find this reworked code pretty hard to
>> understand, and not necessarily because of how it's written. The problem
>> is it the same loop tries to juggle multiple pieces of information with
>> different lifespans, and so on. I find it really hard to reason about
>> how it behaves ...
>
> Yeah, it'd be nice if we had a peek option for sortsupport, that'd
> improve context handling.
>
>> I did try to measure how much it actually saves, but none of the tests I
>> did actually found measurable improvement. So I'm tempted to just not
>> include this part, and accept that we may deserialize some of the tuples
>> unnecessarily.
>>
>> Did you actually observe measurable improvements in some cases?
>
> The improvements would mostly stem from brin indexes with multiple
> (potentially compressed) by-ref types, as they go through more complex
> and expensive code to deserialize, requiring separate palloc() and
> memcpy() calls each.
> For single-column and by-value types the improvements are expected to
> be negligible, because there is no meaningful difference between
> copying a single by-ref value and copying its container; the
> additional work done for each tuple is marginal for those.
>
> For an 8-column BRIN index ((sha256((id)::text::bytea)::text),
> (sha256((id+1)::text::bytea)::text),
> (sha256((id+2)::text::bytea)::text), ...) instrumented with 0003 I
> measured a difference of 10x less time spent in the main loop of
> _brin_end_parallel, from ~30ms to 3ms when dealing with 55k 1-block
> ranges. It's not a lot, but worth at least something, I guess?
>

It is something, but I can't really convince myself it's worth the extra
code complexity. It's a somewhat extreme example, and the parallelism
certainly saves much more than this.

> The attached patch fixes the issue that you called out .
> It also further updates _brin_end_parallel: the final 'write empty
> tuples' loop is never hit and is thus removed, because if there were
> any tuples in the spool we'd have filled the empty ranges at the end
> of the main loop, and if there were no tuples in the spool then the
> memtuple would still be at its original initialized value of 0 thus
> resulting in a constant false condition. I also updated some comments.
>

Ah, right. I'll take a look tomorrow, but I guess I didn't realize we
insert the empty ranges in the main loop, because we're already looking
at the *next* summary.

But I think the idea was to insert empty ranges if there's a chunk of
empty ranges at the end of the table, after the last tuple the index
build reads. But I'm not sure that can actually happen ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-11-29 21:02:11 Re: [HACKERS] Changing references of password encryption to hashing
Previous Message Matthias van de Meent 2023-11-29 20:30:32 Re: Parallel CREATE INDEX for BRIN indexes