From: | "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com> |
---|---|
To: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "Andres Freund" <andres(at)anarazel(dot)de>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "Shankaran, Akash" <akash(dot)shankaran(at)intel(dot)com>, "Kim, Andrew" <andrew(dot)kim(at)intel(dot)com> |
Cc: | <tianyou(dot)li(at)intel(dot)com> |
Subject: | [RFC] Enhance scalability of TPCC performance on HCC (high-core-count) systems |
Date: | 2025-07-08 19:11:46 |
Message-ID: | e241f2c1-e2e2-41b3-a9d9-dbe9589643e0@intel.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dear PostgreSQL Community,
Over recent months, we've submitted several patches ([1][2][3][4])
targeting performance bottlenecks in HammerDB/TPROC-C scalability on
high-core-count (HCC) systems. Recognizing these optimizations form a
dependent chain (later patches build upon earlier ones), we’d like to
present a holistic overview of our findings and proposals to accelerate
review and gather community feedback.
---
### Why HCC and TPROC-C Matter
Modern servers now routinely deploy 100s of cores (approaching 1,000+),
introducing hardware challenges like NUMA latency and cache coherency
overheads. For Cloud Service Providers (CSPs) offering managed Postgres,
scalable HCC performance is critical to maximize hardware ROI.
HammerDB/TPROC-C—a practical, industry-standard OLTP benchmark—exposes
critical scalability roadblocks under high concurrency, making it
essential for real-world performance validation.
---
### The Problem: Scalability Collapse
Our analysis on a 384-vCPU Intel system revealed severe scalability
collapse: HammerDB’s NOPM metric regressed as core counts increased
(Fig 1). We identified three chained bottlenecks:
1. Limited WALInsertLocks parallelism, starving CPU utilization
(only 17.4% observed).
2. Acute contention on insertpos_lck when #1 was mitigated.
3. LWLock shared acquisition overhead becoming dominant after #1–#2
were resolved.
---
### Proposed Optimization Steps
Our three-step approach tackles these dependencies systematically:
Step 1: Unlock Parallel WAL Insertion
Patch [1]: Increase NUM_XLOGINSERT_LOCKS (allowing more concurrent
XLog inserters) as bcc/offcputime flamegraph in Fig 2 shows the cause is
low CPU utilization is the low NUM_XLOGINSERT_LOCKS restricts the
current XLog inserters.
Patch [2]: Replace insertpos_lck spinlock with lock-free XLog
reservation via atomic operations. This reduces the critical section
to a single pg_atomic_fetch_add_u64(), cutting severe lock contention
when reserving WAL space. (Kudos to Yura Sokolov for enhancing
robustness with a Murmur-hash table!)
Result: [1]+[2] 1.25x NOPM gains.
(Note: To avoid confusion with data in [1], the other device achieving
~1.8x improvement has 480 vCPUs)
Step 2 & 3: Optimize LWLock Scalability
Patch [3]: Merge LWLock shared-state updates into a single atomic
add (replacing read-modify-write loops). This reduces cache coherence
overhead under contention.
Result: [1]+[2]+[3] 1.52x NOPM gains.
Patch [4]: Introduce ReadBiasedLWLock for heavily shared Locks
(e.g., ProcArrayLock). Partitions reader lock states across 16 cache
lines, mitigating readers’ atomic contention.
Result: [1]+[2]+[3]+[4] 2.10x NOPM improvement.
---
### Overall Impact
With all patches applied, we observe:
- 2.06x NOPM improvement vs. upstream (384-vCPU, HammerDB: 192 VU, 757
warehouse).
- Accumulated gains for each optimization step (Fig 3)
- Enhanced performance scalability with core count (Fig 4)
---
### Figures & Patch Links
Fig 1: TPROC-C scalability regression (1 socket view)
Fig 2: offcputime flamegraph (pre-optimization)
Fig 3: Accumulated gains (full cores)
Fig 4: Accumulated gains vs core count (1 socket view)
[1] Increase NUM_XLOGINSERT_LOCKS:
https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447(at)postgrespro(dot)ru
[2] Lock-free XLog Reservation from WAL:
https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
[3] Optimize shared LWLock acquisition for high-core-count systems:
https://www.postgresql.org/message-id/flat/73d53acf-4f66-41df-b438-5c2e6115d4de%40intel.com
[4] Optimize LWLock scalability via ReadBiasedLWLock for heavily-shared
locks:
https://www.postgresql.org/message-id/e7d50174-fbf8-4a82-a4cd-1c4018595d1b@intel.com
Best regards,
Zhiguo
Attachment | Content-Type | Size |
---|---|---|
![]() |
image/png | 26.2 KB |
Fig2.png | image/png | 261.4 KB |
![]() |
image/png | 34.2 KB |
Fig4.png | image/png | 77.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-07-08 19:29:48 | Re: What is a typical precision of gettimeofday()? |
Previous Message | Hannu Krosing | 2025-07-08 18:54:33 | Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem) |