| From: | "Okanovic, Haris" <harisokn(at)amazon(dot)com> |
|---|---|
| To: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Replace spin-wait with futex-mutex in LWLockWaitListLock() on Linux aarch64 |
| Date: | 2026-06-05 21:52:58 |
| Message-ID: | DM6PR18MB29081469262A7BBCE85220B3A8112@DM6PR18MB2908.namprd18.prod.outlook.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Hackers,
The attached patch replaces LWLockWaitListLock()'s spin-polling loop
with a futex-mutex on Linux aarch64 to address a scalability bottleneck
on large Arm systems. I'd appreciate the community's perspective on
whether platform-specific logic in LWLock is an acceptable approach to
improve Arm performance, or any alternative suggestions.
Problem:
On large Arm Neoverse V3 machines (192 cores), we observe excessive
time spent in LWLockWaitListLock() under high concurrency. For example,
pgbench TPC-B at 1000+ clients shows 40%+ of CPU time is consumed
spinning on atomic operations acquiring wait-list locks. PMU profiles
show this to be causing excessive cache-line bouncing.
Benchmarks:
Pgbench results show 55% improvement on Neoverse V3 at large scale, and
marginal improvements or no change on Neoverse V2, V1, and N1. Max
client count the server can handle also doubles on Neoverse V3 and V2 -
that is peak throughput is at ~1000 clients instead of ~500.
Intel Granite Rapids and AMD Turin (x86_64) both show minor degradation
with the change, which is the reason the patch is currently limited to
aarch64 only.
Benchmark results can be found in these plots:
https://github.com/harisokanovic/harismisc/tree/master/postgres/pgsqlscaling/2026-06-05/
- 9g/8g/7g/6g are Arm Neoverse V3/V2/V1/N1 systems at different AWS sizes.
- 8i are Intel Granite Rapids systems at different sizes.
- 8a are AMD Turin systems at different sizes.
- m*.48xl are 192 cores, 768 GB DRAM.
- m*.24xl are 96 cores, 384 GB DRAM.
- m*.16xl are 64 cores, 256 GB DRAM.
- m*.4xl are 16 cores, 64 GB DRAM.
Thanks,
Haris Okanovic
AWS Graviton Software
| Attachment | Content-Type | Size |
|---|---|---|
| LWLockWaitListLock-futex-mutex-Linux-aarch64-v1.patch | text/x-patch | 4.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nathan Bossart | 2026-06-05 22:03:05 | Re: alert clients when prepared statements are deallocated |
| Previous Message | Jeff Davis | 2026-06-05 21:34:45 | Re: dict_synonym.c: fix truncation of multibyte sequence |