Re: WIP: parallel GiST index builds

From: Kirill Reshke <reshkekirill(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>, Andreas Karlsson <andreas(at)proxel(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: parallel GiST index builds
Date: 2025-10-23 14:13:03
Message-ID: CALdSSPi5HLxOdopuOtCA8ENGfnZDjrpu4t7VRGpD+MJA=KnxCQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 8 Nov 2024 at 19:53, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 10/31/24 19:05, Andrey M. Borodin wrote:
> >
> >
> >> On 8 Oct 2024, at 17:05, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >>
> >> Here's an updated patch adding the queryid.
> >
> > I've took another round of looking through the patch.
> > Everything I see seems correct to me. It just occurred to me that we
> > will have: buffered build, parallel build, sorting build. All 3 aiming
> > to speed things up. I really hope that we will find a way to parallel
> > sorting build, because it will be fastest for sure.
> >
>
> The number of different ways to build a GiST index worries me. When we
> had just buffered vs. sorted builds, that was pretty easy decision,
> because buffered builds are much faster 99% of the time.
>
> But for parallel builds it's not that clear - it can easily happen that
> we use much more CPU, without speeding anything up. We just start as
> many parallel workers as possible, given the maintenance_work_mem value,
> and hope for the best.
>
> Maybe with parallel buffered builds it's be clearer ... but I'm not
> sufficiently familiar with that code, and I don't have time to study
> that at the moment because of other patches. Someone else would have to
> take a stab at that ...
>
> >
> > Currently we have some instances of such code...or similar... or
> > related code.
> >
> > if (is_build)
> > {
> > if (is_parallel)
> > recptr = GetFakeLSNForUnloggedRel();
> > else
> > recptr = GistBuildLSN;
> > }
> > else
> > {
> > if (RelationNeedsWAL(rel))
> > {
> > recptr = actuallyWriteWAL();
> > }
> > else
> > recptr = gistGetFakeLSN(rel);
> > }
> > // Use recptr
> >
> > In previous review I was proponent of not adding arguments to
> > gistGetFakeLSN(). But now I see that now logic of choosing LSN
> > source is sprawling across various places...
> >
> > Now I do not have a strong point of view on this. Do you think if
> > something like following would be clearer?
> > if (!is_build && RelationNeedsWAL(rel))
> > {
> > recptr = actuallyWriteWAL();
> > }
> > else
> > recptr = gistGetFakeLSN(rel, is_build, is_parallel);
> >
> > Just as an idea.
> >
> > I'm mostly looking on GiST-specific parts of the patch, while things
> > around entering parallel mode seems much more complicated. But as far as
> > I can see, large portions of this code are taken from B-tree\BRIN.
> >
>
> I agree the repeated code is pretty tedious, and it's also easy to miss
> one of the places when changing the logic, so I think wrapping that in
> some function makes sense.
>
>
> regards
>
> --
> Tomas Vondra

Hi!

Here is a rebased version of your patch.
I have tested this patch on a very biased dataset. My test was:

yes 2 | awk '{print $1","$1}'| head -400000000 > data.dat
create table t (p point);
copy t from '....data.dat';

and then build GiST index on this.

I did get 3x time speedup for buffered build:
create index on t using gist(p) with (buffering=on);

But this was still longer than the sorted build. Should we consider
speeding up sorted builds with parallel sorting?

I also checked GiST indexes with amcheck from [0]. It did not complain.

[0] https://www.postgresql.org/message-id/CALdSSPhs4sCd5S_euKxTufk8sOA739ydBwhoGFYQenya7YZyiA%40mail.gmail.com

--
Best regards,
Kirill Reshke

Attachment Content-Type Size
v20251023-0001-WIP-parallel-GiST-build.patch application/octet-stream 33.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-10-23 14:27:11 Re: CI: Add task that runs pgindent
Previous Message Jelte Fennema-Nio 2025-10-23 14:10:04 Re: Add uuid_to_base32hex() and base32hex_to_uuid() built-in functions