Quick Links

Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)

From:	Bryan Green <dbryan(dot)green(at)gmail(dot)com>
To:	Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
Date:	2026-03-09 17:06:28
Message-ID:	CAF+pBj9M8rCoS-EEBwLiA6hdxm_UNjxMXG9Vc4RU8xifqnQB-g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I meant to add that the micro-benchmark went through each loop 100,000,000
iterations for each run...
for 20 runs to come up with those numbers.

--bg

On Mon, Mar 9, 2026 at 11:02 AM Bryan Green <dbryan(dot)green(at)gmail(dot)com> wrote:

> I performed a micro-benchmark on my dual epyc (zen 2) server and version 1
> wins for small values of n.
>
> 20 runs:
>
> n version min median mean max stddev noise%
> -----------------------------------------------------------------------
> n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5%
> n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7%
>
> n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1%
> n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3%
>
> n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2%
> n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7%
>
> But, micro-benchmarks always make me nervous, so I looked at the actual
> instruction cost for my
> platform given the version 1 and version 2 code.
>
> If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
> tables: version 1 (loop body)
> has a critical path of ~5-6 cycles per iteration. version 2 (loop body)
> has ~3-4 cycles per iteration.
>
> The problem for version 2 is that the call to memcpy is ~24-30 cycles due
> to the stub + function call + return
> and branch predictor pressure on first call. This probably results in
> ~2.5 ns per iteration cost for version 2.
>
> So, no I wouldn't call it an optimization. But, it will be interesting to
> hear other opinions on this.
>
> --bg
>
>
> On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> wrote:
>
>>
>>
>> Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <dbryan(dot)green(at)gmail(dot)com>
>> escreveu:
>>
>>> I created an example that is a little bit closer to the actual code and
>>> changed the compiler from C++ to C.
>>>
>>> It is interesting the optimization that the compiler has chosen for
>>> version 1 versus version 2. One calls
>>> memcpy and one doesn't. There is a good chance the inlining of memcpy
>>> as SSE+scalar per iteration
>>> will be faster for syscache scans-- which I believe are usually small
>>> (1-4 keys?).
>>>
>> I doubt the inline version is better.
>> Clang is supported too and the code generated is much better with memcpy
>> one call outside of the loop.
>>
>>
>>>
>>> Probably the only reason to do this patch would be if N is normally
>>> large or if this is considered an
>>> improvement in code clarity without a detrimental impact on small N
>>> syscache scans.
>>> I realize you only said "possible small optimization". It might be
>>> worthwhile to benchmark the code for
>>> different values of n to determine if there is a tipping point either
>>> way?
>>>
>> In your opinion, shouldn't this be considered an optimization, even a
>> small one?
>>
>> best regards,
>> Ranier Vilela
>>
>

In response to

Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) at 2026-03-09 17:02:05 from Bryan Green

Responses

Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) at 2026-03-20 09:54:05 from lakshmi

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrey Borodin	2026-03-09 17:07:59	Re: Compression of bigger WAL records
Previous Message	Xuneng Zhou	2026-03-09 17:05:26	Re: Refactor recovery conflict signaling a little