| From: | Bryan Green <dbryan(dot)green(at)gmail(dot)com> |
|---|---|
| To: | Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> |
| Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) |
| Date: | 2026-03-09 17:02:05 |
| Message-ID: | CAF+pBj-K2bgNQRc9ih01WFmAWUaQtVbS37jLtYdYh5LOwOkF6A@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
I performed a micro-benchmark on my dual epyc (zen 2) server and version 1
wins for small values of n.
20 runs:
n version min median mean max stddev noise%
-----------------------------------------------------------------------
n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5%
n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7%
n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1%
n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3%
n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2%
n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7%
But, micro-benchmarks always make me nervous, so I looked at the actual
instruction cost for my
platform given the version 1 and version 2 code.
If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
tables: version 1 (loop body)
has a critical path of ~5-6 cycles per iteration. version 2 (loop body)
has ~3-4 cycles per iteration.
The problem for version 2 is that the call to memcpy is ~24-30 cycles due
to the stub + function call + return
and branch predictor pressure on first call. This probably results in ~2.5
ns per iteration cost for version 2.
So, no I wouldn't call it an optimization. But, it will be interesting to
hear other opinions on this.
--bg
On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> wrote:
>
>
> Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <dbryan(dot)green(at)gmail(dot)com>
> escreveu:
>
>> I created an example that is a little bit closer to the actual code and
>> changed the compiler from C++ to C.
>>
>> It is interesting the optimization that the compiler has chosen for
>> version 1 versus version 2. One calls
>> memcpy and one doesn't. There is a good chance the inlining of memcpy as
>> SSE+scalar per iteration
>> will be faster for syscache scans-- which I believe are usually small
>> (1-4 keys?).
>>
> I doubt the inline version is better.
> Clang is supported too and the code generated is much better with memcpy
> one call outside of the loop.
>
>
>>
>> Probably the only reason to do this patch would be if N is normally large
>> or if this is considered an
>> improvement in code clarity without a detrimental impact on small N
>> syscache scans.
>> I realize you only said "possible small optimization". It might be
>> worthwhile to benchmark the code for
>> different values of n to determine if there is a tipping point either way?
>>
> In your opinion, shouldn't this be considered an optimization, even a
> small one?
>
> best regards,
> Ranier Vilela
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Xuneng Zhou | 2026-03-09 17:05:26 | Re: Refactor recovery conflict signaling a little |
| Previous Message | Álvaro Herrera | 2026-03-09 16:51:48 | Re: Avoid resource leak (src/bin/pg_dump/pg_dumpall.c) |