From: | Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> |
---|---|
To: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com> |
Cc: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Speeding up GIST index creation for tsvectors |
Date: | 2021-08-02 03:40:31 |
Message-ID: | CAJ3gD9ftbJ2Hjf2NJVO83J_8-soVGy2d=JgR91peUYDRfTFknQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, 20 Mar 2021 at 02:19, John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
> > Regarding the alignment changes... I have removed the code that
> > handled the leading identically unaligned bytes, for lack of evidence
> > that percentage of such cases is significant. Like I noted earlier,
> > for the tsearch data I used, identically unaligned cases were only 6%.
> > If I find scenarios where these cases can be significant after all and
> > if we cannot do anything in the gist index code, then we might have to
> > bring back the unaligned byte handling. I didn't get a chance to dig
> > deeper into the gist index implementation to see why they are not
> > always 8-byte aligned.
>
> I find it stranger that something equivalent to char* is not randomly misaligned, but rather only seems to land on 4-byte boundaries.
>
> [thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.
>
> FWIW, I anticipate some push back from the community because of the fact that the optimization relies on statistical phenomena.
I dug into this issue for tsvector type. Found out that it's the way
in which the sign array elements are arranged that is causing the pointers to
be misaligned:
Datum
gtsvector_picksplit(PG_FUNCTION_ARGS)
{
......
cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2));
cache_sign = palloc(siglen * (maxoff + 2));
for (j = 0; j < maxoff + 2; j++)
cache[j].sign = &cache_sign[siglen * j];
....
}
If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
cases point to non-8-byte-aligned addresses, as you can see in the
above code snippet.
Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
of the misalignment. This change applied over the 0001-v3 patch gives
additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
space, but for not-so-small siglens, this looks worth doing. Haven't
yet checked into types other than tsvector.
Will get back with your other review comments. I thought, meanwhile, I
can post the above update first.
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2021-08-02 04:52:32 | Re: Parallel Inserts (WAS: [bug?] Missed parallel safety checks..) |
Previous Message | Amit Kapila | 2021-08-02 03:21:06 | Re: Skipping logical replication transactions on subscriber side |