strange perf regression with data checksums

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: strange perf regression with data checksums
Date: 2025-05-09 12:04:49
Message-ID: 941f0190-e3c6-4622-9ac7-c04e936e5fdb@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

While running some benchmarks comparing 17 and 18, I ran into a simple
workload where 18 throughput drops by ~80%. After pulling my hair for a
couple hours I realized the change that triggered this is 04bec894a04c,
which set checksums on by default. Which is very bizarre, because the
workload is read-only and fits into shared buffers.

I've only observed this on large machines with 96+ cores, on azure, both
with Intel (Xeon 8370C) and AMD (EPYC 9V33X). I've not been successful
in reproducing it on the puny machines I have at home.

Let me demonstrate the issue.

1) Create a cluster, with an increased connection limit:

pg_ctl -D data init
echo 'max_connections = 100' >> data/postgresql.conf
pg_ctl -D data -l pg.log start

Now the benchmark itself - it's fairly trivial, regular pgbench on scale
1, with an extra index on "pgbench_accounts.bid" column:

createdb test
pgbench -i -s 1 test
psql test -c "create index on pgbench_accounts (bid)"

and a script with a simple query using the index

select count(*) from pgbench_accounts where bid = 0

Cool, now let's get some numbers for 32-160 clients:

for c in 32 64 96 128 160; do

pgbench -n -f select.sql -T 10 -M prepared -c $c -j $c test | grep
'tps';

done;

Which produces this:

tps = 752464.727387 (without initial connection time)

tps = 1062150.015297 (without initial connection time)

tps = 572205.386159 (without initial connection time)

tps = 568579.663980 (without initial connection time)

tps = 561360.493639 (without initial connection time)

Clearly, at 96 clients the throughput just tanks. Now let's disable
checksums on the cluster:

pg_ctl -D data -l pg.log stop

pg_checksums --disable data

pg_ctl -D data -l pg.log start

and run the script again

tps = 753484.468817 (without initial connection time)

tps = 1083752.631856 (without initial connection time)

tps = 1862008.466802 (without initial connection time)

tps = 1826484.489433 (without initial connection time)

tps = 1818400.279069 (without initial connection time)

Clearly, the throughput does not drop, and it's ~3.5x higher. This is
from the Xeon machine, but I've seen the same thing on the EPYC boxes.
The impact varies, but in general it's 70-80%.

I'm not suggesting this is caused by 04bec894a04c, or even specific to
PG 18. I see the same issue on 17, except that 17 does not enable
checksums by default. For example on EPYC 9V74 the 17 does this:

32 762187.724859
64 1284731.279766
96 2978828.264373
128 2991561.835178
160 2971689.161136

and with checksums

32 874266.549870
64 1286382.426281
96 569647.384735
128 562128.010301
160 561826.908181

So, similar regression ...

I find this quite bizarre / puzzling, because this is a read-only
workload, with absolutely no writes, and tiny data set (~15MB), i.e.
everything fits into shared buffers. Why would that be affected by
checksums at all?

I spent some time profiling this, without much success. This is what I
get from perf top:

Samples: 6M of event 'cycles:P', 4000 Hz, Event count (approx.):
5270683795302 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
50.94% postgres [.] pg_atomic_read_u32_impl
17.32% postgres [.] pg_atomic_compare_exchange_u32_impl
10.17% postgres [.] spin_delay
5.83% postgres [.] pg_atomic_fetch_or_u32_impl
1.64% postgres [.] pg_atomic_compare_exchange_u32_impl
1.20% postgres [.] BufferDescriptorGetBuffer
0.92% postgres [.] perform_spin_delay

and the report with backtraces says most of the time is spent here:

--97.00%--btgettuple
|
--96.98%--_bt_first
|
|--48.82%--_bt_readfirstpage
| |
| |--44.57%--_bt_steppage
| | |
| | --44.45%--ReleaseBuffer
| | |
| | --44.43%--UnpinBuffer
| | |
| | ...
| |...
|
--48.11%--_bt_search
|

--47.89%--_bt_relandgetbuf

The atomic ops come from pinning/unpinning buffers. I realize it's
possible it gets much more expensive under concurrency (the clients
simply have to compete when updating the same counter, and with enough
clients there'll be more conflicts and retries). Kinda unfortunate, and
maybe we should do something about it, not sure.

But why would it depend on checksums at all? This read-only test should
be entirely in-memory, so how come it's affected?

regards

--
Tomas Vondra

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nisha Moond 2025-05-09 12:09:12 Re: Fix slot synchronization with two_phase decoding enabled
Previous Message Alena Rybakina 2025-05-09 12:03:55 Re: Vacuum statistics