Re: Speeding up GIST index creation for tsvectors

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Speeding up GIST index creation for tsvectors
Date: 2021-08-02 03:40:31
Message-ID: CAJ3gD9ftbJ2Hjf2NJVO83J_8-soVGy2d=JgR91peUYDRfTFknQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 20 Mar 2021 at 02:19, John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> > Regarding the alignment changes... I have removed the code that
> > handled the leading identically unaligned bytes, for lack of evidence
> > that percentage of such cases is significant. Like I noted earlier,
> > for the tsearch data I used, identically unaligned cases were only 6%.
> > If I find scenarios where these cases can be significant after all and
> > if we cannot do anything in the gist index code, then we might have to
> > bring back the unaligned byte handling. I didn't get a chance to dig
> > deeper into the gist index implementation to see why they are not
> > always 8-byte aligned.
>
> I find it stranger that something equivalent to char* is not randomly misaligned, but rather only seems to land on 4-byte boundaries.
>
> [thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.
>
> FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical phenomena.

I dug into this issue for tsvector type. Found out that it's the way
in which the sign array elements are arranged that is causing the pointers to
be misaligned:

Datum
gtsvector_picksplit(PG_FUNCTION_ARGS)
{
......
cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2));
cache_sign = palloc(siglen * (maxoff + 2));

for (j = 0; j < maxoff + 2; j++)
cache[j].sign = &cache_sign[siglen * j];
....
}

If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
cases point to non-8-byte-aligned addresses, as you can see in the
above code snippet.

Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
of the misalignment. This change applied over the 0001-v3 patch gives
additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
space, but for not-so-small siglens, this looks worth doing. Haven't
yet checked into types other than tsvector.

Will get back with your other review comments. I thought, meanwhile, I
can post the above update first.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-08-02 04:52:32 Re: Parallel Inserts (WAS: [bug?] Missed parallel safety checks..)
Previous Message Amit Kapila 2021-08-02 03:21:06 Re: Skipping logical replication transactions on subscriber side