[RFC] Enhance scalability of TPCC performance on HCC (high-core-count) systems

From: "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "Andres Freund" <andres(at)anarazel(dot)de>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, "Kim, Andrew" <andrew(dot)kim(at)intel(dot)com>
Cc: <tianyou(dot)li(at)intel(dot)com>
Subject: [RFC] Enhance scalability of TPCC performance on HCC (high-core-count) systems
Date: 2025-07-08 19:11:46
Message-ID: e241f2c1-e2e2-41b3-a9d9-dbe9589643e0@intel.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear PostgreSQL Community,

Over recent months, we've submitted several patches ([1][2][3][4])
targeting performance bottlenecks in HammerDB/TPROC-C scalability on
high-core-count (HCC) systems. Recognizing these optimizations form a
dependent chain (later patches build upon earlier ones), we’d like to
present a holistic overview of our findings and proposals to accelerate
review and gather community feedback.

---
### Why HCC and TPROC-C Matter​
Modern servers now routinely deploy 100s of cores (approaching 1,000+),
introducing hardware challenges like NUMA latency and cache coherency
overheads. For Cloud Service Providers (CSPs) offering managed Postgres,
scalable HCC performance is critical to maximize hardware ROI.
HammerDB/TPROC-C—a practical, industry-standard OLTP benchmark—exposes
critical scalability roadblocks under high concurrency, making it
essential for real-world performance validation.

---
### The Problem: Scalability Collapse
Our analysis on a 384-vCPU Intel system revealed severe scalability
collapse: HammerDB’s NOPM metric ​​regressed​​ as core counts increased
(Fig 1). We identified three chained bottlenecks:

​​1. Limited WALInsertLocks parallelism​​, starving CPU utilization
(only 17.4% observed).
2. Acute ​​contention on insertpos_lck​​ when #1 was mitigated.
​​3. LWLock shared acquisition overhead​​ becoming dominant after #1–#2
were resolved.

---
### Proposed Optimization Steps​
Our three-step approach tackles these dependencies systematically:

​​Step 1: Unlock Parallel WAL Insertion​​
​​Patch [1]​​: Increase NUM_XLOGINSERT_LOCKS (allowing more concurrent
XLog inserters) as bcc/offcputime flamegraph in Fig 2 shows the cause is
low CPU utilization is the low NUM_XLOGINSERT_LOCKS restricts the
current XLog inserters.

​​Patch [2]​​: Replace insertpos_lck spinlock with ​​lock-free XLog
reservation​​ via atomic operations. This reduces the critical section
to a single pg_atomic_fetch_add_u64(), cutting severe lock contention
when reserving WAL space. (Kudos to Yura Sokolov for enhancing
robustness with a Murmur-hash table!)

Result​​: [1]+[2] 1.25x NOPM gains.
(Note: To avoid confusion with data in [1], the other device achieving
~1.8x improvement has 480 vCPUs)

​​Step 2 & 3: Optimize LWLock Scalability​​
​​Patch [3]​​: Merge LWLock shared-state updates into a ​​single atomic
add​​ (replacing read-modify-write loops). This reduces cache coherence
overhead under contention.

Result: [1]+[2]+[3] 1.52x NOPM gains.

​​Patch [4]​​: Introduce ​​ReadBiasedLWLock​​ for heavily shared Locks
(e.g., ProcArrayLock). Partitions reader lock states across 16 cache
lines, mitigating readers’ atomic contention.

​​Result​​: [1]+[2]+[3]+[4] 2.10x NOPM improvement.

---
### Overall Impact
With all patches applied, we observe:
- 2.06x NOPM improvement​​ vs. upstream (384-vCPU, HammerDB: 192 VU, 757
warehouse).
- Accumulated gains for each optimization step (Fig 3)
- Enhanced performance scalability with core count (Fig 4)

---
### Figures & Patch Links
Fig 1: TPROC-C scalability regression (1 socket view)
Fig 2: offcputime flamegraph (pre-optimization)
Fig 3: Accumulated gains (full cores)
Fig 4: Accumulated gains vs core count (1 socket view)

[1] Increase NUM_XLOGINSERT_LOCKS:
https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447(at)postgrespro(dot)ru
[2] Lock-free XLog Reservation from WAL:
https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
[3] Optimize shared LWLock acquisition for high-core-count systems:
https://www.postgresql.org/message-id/flat/73d53acf-4f66-41df-b438-5c2e6115d4de%40intel.com
[4] Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared
locks:
https://www.postgresql.org/message-id/e7d50174-fbf8-4a82-a4cd-1c4018595d1b@intel.com

Best regards,
Zhiguo

Attachment Content-Type Size
image/png 26.2 KB
Fig2.png image/png 261.4 KB
image/png 34.2 KB
Fig4.png image/png 77.9 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-07-08 19:29:48 Re: What is a typical precision of gettimeofday()?
Previous Message Hannu Krosing 2025-07-08 18:54:33 Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)