Re: WIP: parallel GiST index builds

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Kirill Reshke <reshkekirill(at)gmail(dot)com>
Cc: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>, Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: parallel GiST index builds
Date: 2025-10-24 12:43:40
Message-ID: 84abeb00-c102-4bdc-a5c3-e8afa6455de3@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/23/25 16:13, Kirill Reshke wrote:
> On Fri, 8 Nov 2024 at 19:53, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> On 10/31/24 19:05, Andrey M. Borodin wrote:
>>>
>>>
>>>> On 8 Oct 2024, at 17:05, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>>>
>>>> Here's an updated patch adding the queryid.
>>>
>>> I've took another round of looking through the patch.
>>> Everything I see seems correct to me. It just occurred to me that we
>>> will have: buffered build, parallel build, sorting build. All 3 aiming
>>> to speed things up. I really hope that we will find a way to parallel
>>> sorting build, because it will be fastest for sure.
>>>
>>
>> The number of different ways to build a GiST index worries me. When we
>> had just buffered vs. sorted builds, that was pretty easy decision,
>> because buffered builds are much faster 99% of the time.
>>
>> But for parallel builds it's not that clear - it can easily happen that
>> we use much more CPU, without speeding anything up. We just start as
>> many parallel workers as possible, given the maintenance_work_mem value,
>> and hope for the best.
>>
>> Maybe with parallel buffered builds it's be clearer ... but I'm not
>> sufficiently familiar with that code, and I don't have time to study
>> that at the moment because of other patches. Someone else would have to
>> take a stab at that ...
>>
>>>
>>> Currently we have some instances of such code...or similar... or
>>> related code.
>>>
>>> if (is_build)
>>> {
>>> if (is_parallel)
>>> recptr = GetFakeLSNForUnloggedRel();
>>> else
>>> recptr = GistBuildLSN;
>>> }
>>> else
>>> {
>>> if (RelationNeedsWAL(rel))
>>> {
>>> recptr = actuallyWriteWAL();
>>> }
>>> else
>>> recptr = gistGetFakeLSN(rel);
>>> }
>>> // Use recptr
>>>
>>> In previous review I was proponent of not adding arguments to
>>> gistGetFakeLSN(). But now I see that now logic of choosing LSN
>>> source is sprawling across various places...
>>>
>>> Now I do not have a strong point of view on this. Do you think if
>>> something like following would be clearer?
>>> if (!is_build && RelationNeedsWAL(rel))
>>> {
>>> recptr = actuallyWriteWAL();
>>> }
>>> else
>>> recptr = gistGetFakeLSN(rel, is_build, is_parallel);
>>>
>>> Just as an idea.
>>>
>>> I'm mostly looking on GiST-specific parts of the patch, while things
>>> around entering parallel mode seems much more complicated. But as far as
>>> I can see, large portions of this code are taken from B-tree\BRIN.
>>>
>>
>> I agree the repeated code is pretty tedious, and it's also easy to miss
>> one of the places when changing the logic, so I think wrapping that in
>> some function makes sense.
>>
>>
>> regards
>>
>> --
>> Tomas Vondra
>
> Hi!
>
> Here is a rebased version of your patch.
> I have tested this patch on a very biased dataset. My test was:
>
> yes 2 | awk '{print $1","$1}'| head -400000000 > data.dat
> create table t (p point);
> copy t from '....data.dat';
>
> and then build GiST index on this.
>
> I did get 3x time speedup for buffered build:
> create index on t using gist(p) with (buffering=on);
>
> But this was still longer than the sorted build. Should we consider
> speeding up sorted builds with parallel sorting?
>

Yes, this was the reason why I did not try to do parallel GiST builds in
PG18. The sorted builds are more efficient than parallel builds - maybe
not always, but I'm not sure how to reliably decide that. Maybe parallel
sorted builds would be faster, not sure.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2025-10-24 13:03:48 Re: Avoid resource leak (src/test/regress/pg_regress.c)
Previous Message Yura Sokolov 2025-10-24 12:43:34 Re: use SIMD in GetPrivateRefCountEntry()