Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)

From: Bryan Green <dbryan(dot)green(at)gmail(dot)com>
To: Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
Date: 2026-03-12 17:48:38
Message-ID: CAF+pBj-pAGnTh2un8RGcDqSYuMnwGhXv5_MteB77FNjf-Af=tg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I don't think your version 1 memcpy is doing what you think it is doing.

On Thu, Mar 12, 2026 at 12:35 PM Ranier Vilela <ranier(dot)vf(at)gmail(dot)com> wrote:

> Hi.
>
> Em seg., 9 de mar. de 2026 às 14:02, Bryan Green <dbryan(dot)green(at)gmail(dot)com>
> escreveu:
>
>> I performed a micro-benchmark on my dual epyc (zen 2) server and version
>> 1 wins for small values of n.
>>
>> 20 runs:
>>
>> n version min median mean max stddev noise%
>> -----------------------------------------------------------------------
>> n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5%
>> n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7%
>>
>> n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1%
>> n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3%
>>
>> n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2%
>> n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7%
>>
>> But, micro-benchmarks always make me nervous, so I looked at the actual
>> instruction cost for my
>> platform given the version 1 and version 2 code.
>>
>> If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
>> tables: version 1 (loop body)
>> has a critical path of ~5-6 cycles per iteration. version 2 (loop body)
>> has ~3-4 cycles per iteration.
>>
>> The problem for version 2 is that the call to memcpy is ~24-30 cycles due
>> to the stub + function call + return
>> and branch predictor pressure on first call. This probably results in
>> ~2.5 ns per iteration cost for version 2.
>>
>> So, no I wouldn't call it an optimization. But, it will be interesting
>> to hear other opinions on this.
>>
> I made dirty and quick tests with two versions:
> gcc 15.2.0
> gcc -O2 memcpy1.c -o memcpy1
>
> The first test was with keys 10000000 and 10000000 loops:
> version1: on memcpy call
> done in 1873 nanoseconds
>
> version2: inlined memcpy
> not finish
>
> The second test was with keys 4 and 10000000 loops:
> version1: one memcpy call
> version2: inlined memcpy call
>
> version1: done in 1519 nanoseconds
> version2: done in 104981851 nanoseconds
> (1.44692e-05 times faster)
>
> version1: done in 1979 nanoseconds
> version2: done in 110568901 nanoseconds
> (1.78983e-05 times faster)
>
> version1: done in 1814 nanoseconds
> version2: done in 108555484 nanoseconds
> (1.67103e-05 times faster)
>
> version1: done in 1631 nanoseconds
> version2: done in 109867919 nanoseconds
> (1.48451e-05 times faster)
>
> version1: done in 1269 nanoseconds
> version2: done in 111639106 nanoseconds
> (1.1367e-05 times faster)
>
> Unless I'm doing something wrong, one call memcpy wins!
> memcpy1.c attached.
>
> best regards,
> Ranier Vilela
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2026-03-12 17:57:34 Re: Vacuum statistics
Previous Message Nathan Bossart 2026-03-12 17:37:14 Re: Speed up COPY FROM text/CSV parsing using SIMD