Re: Optimize shared LWLock acquisition for high-core-count systems

From: "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: <tianyou(dot)li(at)intel(dot)com>
Subject: Re: Optimize shared LWLock acquisition for high-core-count systems
Date: 2025-07-10 15:57:01
Message-ID: 1a3f09c5-3709-4d84-a706-113d9a72cd74@intel.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/9/2025 3:56 PM, Yura Sokolov wrote:
> 30.05.2025 14:30, Zhou, Zhiguo пишет:
>> Hi Hackers,
>>
>> I am reaching out to solicit your insights and comments on this patch
>> addressing a significant performance bottleneck in LWLock acquisition
>> observed on high-core-count systems. During performance analysis of
>> HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
>> system, we found that LWLockAttemptLock consumed 7.12% of total CPU
>> cycles. This bottleneck becomes even more pronounced (up to 30% of
>> cycles) after applying lock-free WAL optimizations[1][2].
>>
>> Problem Analysis:
>> The current LWLock implementation uses separate atomic operations for
>> state checking and modification. For shared locks (84% of
>> LWLockAttemptLock calls), this requires:
>> 1.Atomic read of lock->state
>> 2.State modification
>> 3.Atomic compare-exchange (with retries on contention)
>>
>> This design causes excessive atomic operations on contended locks, which
>> are particularly expensive on high-core-count systems where cache-line
>> bouncing amplifies synchronization costs.
>>
>> Optimization Approach:
>> The patch optimizes shared lock acquisition by:
>> 1.Merging state read and update into a single atomic add operation
>> 2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
>> 3.Adding a willwait parameter to control optimization usage
>>
>> Key implementation details:
>> - For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
>> reference count
>> - Maintains backward compatibility through state mask adjustments
>> - Preserves existing behavior for:
>> 1) Exclusive locks
>> 2) Non-waiting cases (LWLockConditionalAcquire)
>> - Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)
>>
>> Performance Impact:
>> Testing on a 384-vCPU Intel system shows:
>> - *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
>> - *46%* cumulative improvement when combined with lock-free WAL
>> optimizations[1][2]
>>
>> Patch Contents:
>> 1.Extends shared mask and shifts exclusive lock value
>> 2.Adds willwait parameter to control optimization
>> 3.Updates lock acquisition/release logic
>> 4.Maintains all existing assertions and safety checks
>>
>> The optimization is particularly effective for contended shared locks,
>> which are common in buffer mapping, lock manager, and shared buffer
>> access patterns.
>>
>> Please review this patch for consideration in upcoming PostgreSQL releases.
>>
>> [1] Lock-free XLog Reservation from WAL:
>> https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
>> [2] Increase NUM_XLOGINSERT_LOCKS:
>> https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru
>
> Good day, Zhou.
>
> Could you explain, why your patch is correct?
>
> As far as I understand, it is clearly not correct at this time:
> - SHARED lock count may be incremented many times, because of for(;;) loop
> in LWLockAcquire and because LWLockAttemptLock is called twice per each
> loop iteration in case lock is held in Exclusive mode by someone else.
>
> If your patch is correct, then where I'm wrong?
>
> When I tried to do same thing, I did sub_fetch immediately in case of
> acquisition failure. And did no add_fetch at all if lock is held in
> Exclusive mode.
>
> BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
> to do CAS if lock is held by someone else.
>
> See my version in attach...
>

Good day, Yura.

Thanks for your comments which identifies this critical safety condition
– you're absolutely right that we must guarantee the shared reference
count never overflows into the exclusive bit. Let me clarify the safety
mechanism:

When a reader encounters an exclusive lock during acquisition
(triggering the for(;;) loop), it does temporarily increment the shared
count twice – once per LWLockAttemptLock attempt. However, both
increments occur ​​before​​ the reader waits on its semaphore
(PGSemaphoreLock(proc->sem)). Crucially, when the exclusive holder
releases the lock via LWLockReleaseInternal, it ​​resets the entire lock
state​​ (line 1883: pg_atomic_fetch_and_u32(&lock->state,
~LW_LOCK_MASK)). This clears all reader references, including any
"overcounted" increments from blocked readers.

Thus, when blocked readers wake:

1. They retry acquisition on a ​​zero-initialized state​​
2. Each ultimately increments only ​​once​​ for successful acquisition
3. The transient "overcount" (≤ MAX_BACKENDS × 2) stays safely within
LW_SHARED_MASK

The key invariants are:

- LW_SHARED_MASK = (MAX_BACKENDS << 1) + 1
- Exclusive release resets all shared bits
- Readers never persist >1 reference after wakeup

Does this resolve the concern? I appreciate you flagging this subtlety –
please correct me if I've misunderstood your scenario or misinterpreted
the code.

And I'd appreciate you for sharing your implementation – I particularly
agree with your optimization for ​​exclusive lock acquisition​​.
Avoiding the CAS loop when the lock is already held (by checking state
early) is a clever reduction of atomic operations, which we know are
costly on high-core-count systems. I’ll prioritize evaluating this for
our HammerDB/TPROC-C workload and share benchmark results soon.

Regarding ​​shared locks​​: Your version (using sub_fetch on acquisition
failure) does align more cleanly with the original state machine by
avoiding transient overcounts. I initially came up with a similar
approach but shifted to the single-atomic-increment design to minimize
atomic instructions – a critical priority for our 384-core benchmarks
where atomic ops dominate contention.

Let’s reconcile these strengths:

1. I’ll test your patch head-to-head against our current version in HCC
TPROC-C workloads.
2. If the atomic savings in your exclusive path yield meaningful gains,
we will try to integrate it into our patch immediately.
1. For shared locks: if your design shows comparable performance while
simplifying correctness, it’s a compelling option.

Really appreciate you driving this optimization further!

Regards,
Zhiguo

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-07-10 16:03:06 Re: CHECKPOINT unlogged data
Previous Message Tom Lane 2025-07-10 15:35:26 Re: Using ASSUME in place of ASSERT in non-assert builds