Re: Popcount optimization for the slow-path lookups

From: Andrew Pogrebnoi <andrew(dot)pogrebnoi(at)percona(dot)com>
To: David Geier <geidav(dot)pg(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Popcount optimization for the slow-path lookups
Date: 2025-12-05 14:53:38
Message-ID: CA+aWR11v_s_3wjJRvFPHNbZTv-+dJVWaKc_i4gxH-3jYw_TmFw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David,

Thanks for looking at it!

> I would like to test if I can reproduce your results. Could you share
> your test program?

Here you go:
https://gist.github.com/dAdAbird/1480ff15764f5a6301174806d8512a3a

> You also don't specify an optimization level. That means the default
> level -O0 is used. Does it make it a difference if you run e.g. with -O3?

Yes, good point. I got carried away with experiments and totally forgot
that..

I've re-run tests with -O3 and it seems like the Summing of Adjacent Bits
might make sense for uint64:

X86
```

% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
-lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark

2025-12-05T16:22:43+02:00

Running ./mybenchmark

Run on (12 X 3200 MHz CPU s)

CPU Caches:

L1 Data 32 KiB

L1 Instruction 32 KiB

L2 Unified 256 KiB (x6)

L3 Unified 12288 KiB

Load Average: 2.67, 4.72, 2.90

--------------------------------------------------------------------

Benchmark Time CPU Iterations

--------------------------------------------------------------------

Popcnt64_BranchlessLookup 5.92 ns 5.90 ns 113871130

Popcnt64_AdjacentBits 5.30 ns 5.27 ns 115084258

Popcnt64_PG 8.34 ns 8.32 ns 80622869

Popcnt32_BranchlessLookup 3.62 ns 3.61 ns 197816110

Popcnt32_AdjacentBits 4.56 ns 4.55 ns 154932383

Popcnt32_PG 5.32 ns 5.31 ns 121845083
```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include
-L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark &&
./mybenchmark
2025-12-05T14:29:18+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.01, 0.01, 0.00
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 2.13 ns 2.13 ns 342101937
Popcnt64_AdjacentBits 1.89 ns 1.89 ns 370316571
Popcnt64_PG 3.83 ns 3.83 ns 190990088
Popcnt32_BranchlessLookup 1.19 ns 1.19 ns 591797267
Popcnt32_AdjacentBits 1.73 ns 1.73 ns 409381266
Popcnt32_PG 2.01 ns 2.01 ns 344969625
```

---
Cheers,
Andy

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2025-12-05 14:58:32 Re: making tid and HOTness of UPDATE available to logical decoding plugins
Previous Message Bertrand Drouvot 2025-12-05 14:51:45 Re: More const-marking cleanup