Quick Links

Re: Optimize shared LWLock acquisition for high-core-count systems

From:	"Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
To:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc:	<tianyou(dot)li(at)intel(dot)com>
Subject:	Re: Optimize shared LWLock acquisition for high-core-count systems
Date:	2025-07-10 15:57:01
Message-ID:	1a3f09c5-3709-4d84-a706-113d9a72cd74@intel.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 7/9/2025 3:56 PM, Yura Sokolov wrote:
> 30.05.2025 14:30, Zhou, Zhiguo пишет:
>> Hi Hackers,
>>
>> I am reaching out to solicit your insights and comments on this patch
>> addressing a significant performance bottleneck in LWLock acquisition
>> observed on high-core-count systems. During performance analysis of
>> HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
>> system, we found that LWLockAttemptLock consumed 7.12% of total CPU
>> cycles. This bottleneck becomes even more pronounced (up to 30% of
>> cycles) after applying lock-free WAL optimizations[1][2].
>>
>> Problem Analysis:
>> The current LWLock implementation uses separate atomic operations for
>> state checking and modification. For shared locks (84% of
>> LWLockAttemptLock calls), this requires:
>> 1.Atomic read of lock->state
>> 2.State modification
>> 3.Atomic compare-exchange (with retries on contention)
>>
>> This design causes excessive atomic operations on contended locks, which
>> are particularly expensive on high-core-count systems where cache-line
>> bouncing amplifies synchronization costs.
>>
>> Optimization Approach:
>> The patch optimizes shared lock acquisition by:
>> 1.Merging state read and update into a single atomic add operation
>> 2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
>> 3.Adding a willwait parameter to control optimization usage
>>
>> Key implementation details:
>> - For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
>> reference count
>> - Maintains backward compatibility through state mask adjustments
>> - Preserves existing behavior for:
>> 1) Exclusive locks
>> 2) Non-waiting cases (LWLockConditionalAcquire)
>> - Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)
>>
>> Performance Impact:
>> Testing on a 384-vCPU Intel system shows:
>> - *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
>> - *46%* cumulative improvement when combined with lock-free WAL
>> optimizations[1][2]
>>
>> Patch Contents:
>> 1.Extends shared mask and shifts exclusive lock value
>> 2.Adds willwait parameter to control optimization
>> 3.Updates lock acquisition/release logic
>> 4.Maintains all existing assertions and safety checks
>>
>> The optimization is particularly effective for contended shared locks,
>> which are common in buffer mapping, lock manager, and shared buffer
>> access patterns.
>>
>> Please review this patch for consideration in upcoming PostgreSQL releases.
>>
>> [1] Lock-free XLog Reservation from WAL:
>> https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
>> [2] Increase NUM_XLOGINSERT_LOCKS:
>> https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru
>
> Good day, Zhou.
>
> Could you explain, why your patch is correct?
>
> As far as I understand, it is clearly not correct at this time:
> - SHARED lock count may be incremented many times, because of for(;;) loop
> in LWLockAcquire and because LWLockAttemptLock is called twice per each
> loop iteration in case lock is held in Exclusive mode by someone else.
>
> If your patch is correct, then where I'm wrong?
>
> When I tried to do same thing, I did sub_fetch immediately in case of
> acquisition failure. And did no add_fetch at all if lock is held in
> Exclusive mode.
>
> BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
> to do CAS if lock is held by someone else.
>
> See my version in attach...
>

Good day, Yura.

Thanks for your comments which identifies this critical safety condition
– you're absolutely right that we must guarantee the shared reference
count never overflows into the exclusive bit. Let me clarify the safety
mechanism:

When a reader encounters an exclusive lock during acquisition
(triggering the for(;;) loop), it does temporarily increment the shared
count twice – once per LWLockAttemptLock attempt. However, both
increments occur before the reader waits on its semaphore
(PGSemaphoreLock(proc->sem)). Crucially, when the exclusive holder
releases the lock via LWLockReleaseInternal, it resets the entire lock
state (line 1883: pg_atomic_fetch_and_u32(&lock->state,
~LW_LOCK_MASK)). This clears all reader references, including any
"overcounted" increments from blocked readers.

Thus, when blocked readers wake:

1. They retry acquisition on a zero-initialized state
2. Each ultimately increments only once for successful acquisition
3. The transient "overcount" (≤ MAX_BACKENDS × 2) stays safely within
LW_SHARED_MASK

The key invariants are:

- LW_SHARED_MASK = (MAX_BACKENDS << 1) + 1
- Exclusive release resets all shared bits
- Readers never persist >1 reference after wakeup

Does this resolve the concern? I appreciate you flagging this subtlety –
please correct me if I've misunderstood your scenario or misinterpreted
the code.

And I'd appreciate you for sharing your implementation – I particularly
agree with your optimization for exclusive lock acquisition.
Avoiding the CAS loop when the lock is already held (by checking state
early) is a clever reduction of atomic operations, which we know are
costly on high-core-count systems. I’ll prioritize evaluating this for
our HammerDB/TPROC-C workload and share benchmark results soon.

Regarding shared locks: Your version (using sub_fetch on acquisition
failure) does align more cleanly with the original state machine by
avoiding transient overcounts. I initially came up with a similar
approach but shifted to the single-atomic-increment design to minimize
atomic instructions – a critical priority for our 384-core benchmarks
where atomic ops dominate contention.

Let’s reconcile these strengths:

1. I’ll test your patch head-to-head against our current version in HCC
TPROC-C workloads.
2. If the atomic savings in your exclusive path yield meaningful gains,
we will try to integrate it into our patch immediately.
1. For shared locks: if your design shows comparable performance while
simplifying correctness, it’s a compelling option.

Really appreciate you driving this optimization further!

Regards,
Zhiguo

In response to

Re: Optimize shared LWLock acquisition for high-core-count systems at 2025-07-09 07:56:34 from Yura Sokolov

Responses

Re: Optimize shared LWLock acquisition for high-core-count systems at 2025-07-11 08:35:44 from Yura Sokolov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-07-10 16:03:06	Re: CHECKPOINT unlogged data
Previous Message	Tom Lane	2025-07-10 15:35:26	Re: Using ASSUME in place of ASSERT in non-assert builds