Re: Optimize shared LWLock acquisition for high-core-count systems

From: "Zhou, Zhiguo" <zhiguo(dot)zhou(at)intel(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Cc: <tianyou(dot)li(at)intel(dot)com>
Subject: Re: Optimize shared LWLock acquisition for high-core-count systems
Date: 2025-07-21 16:19:39
Message-ID: 72bb9bb2-9e48-4e27-a147-627b0658eddf@intel.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/11/2025 4:35 PM, Yura Sokolov wrote:
> 10.07.2025 18:57, Zhou, Zhiguo пишет:
>>
>>
>> On 7/9/2025 3:56 PM, Yura Sokolov wrote:
>>> 30.05.2025 14:30, Zhou, Zhiguo пишет:
>>>> Hi Hackers,
>>>>
>>>> I am reaching out to solicit your insights and comments on this patch
>>>> addressing a significant performance bottleneck in LWLock acquisition
>>>> observed on high-core-count systems. During performance analysis of
>>>> HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel
>>>> system, we found that LWLockAttemptLock consumed 7.12% of total CPU
>>>> cycles. This bottleneck becomes even more pronounced (up to 30% of
>>>> cycles) after applying lock-free WAL optimizations[1][2].
>>>>
>>>> Problem Analysis:
>>>> The current LWLock implementation uses separate atomic operations for
>>>> state checking and modification. For shared locks (84% of
>>>> LWLockAttemptLock calls), this requires:
>>>> 1.Atomic read of lock->state
>>>> 2.State modification
>>>> 3.Atomic compare-exchange (with retries on contention)
>>>>
>>>> This design causes excessive atomic operations on contended locks, which
>>>> are particularly expensive on high-core-count systems where cache-line
>>>> bouncing amplifies synchronization costs.
>>>>
>>>> Optimization Approach:
>>>> The patch optimizes shared lock acquisition by:
>>>> 1.Merging state read and update into a single atomic add operation
>>>> 2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
>>>> 3.Adding a willwait parameter to control optimization usage
>>>>
>>>> Key implementation details:
>>>> - For LW_SHARED with willwait=true: Uses atomic fetch-add to increment
>>>> reference count
>>>> - Maintains backward compatibility through state mask adjustments
>>>> - Preserves existing behavior for:
>>>> 1) Exclusive locks
>>>> 2) Non-waiting cases (LWLockConditionalAcquire)
>>>> - Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)
>>>>
>>>> Performance Impact:
>>>> Testing on a 384-vCPU Intel system shows:
>>>> - *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
>>>> - *46%* cumulative improvement when combined with lock-free WAL
>>>> optimizations[1][2]
>>>>
>>>> Patch Contents:
>>>> 1.Extends shared mask and shifts exclusive lock value
>>>> 2.Adds willwait parameter to control optimization
>>>> 3.Updates lock acquisition/release logic
>>>> 4.Maintains all existing assertions and safety checks
>>>>
>>>> The optimization is particularly effective for contended shared locks,
>>>> which are common in buffer mapping, lock manager, and shared buffer
>>>> access patterns.
>>>>
>>>> Please review this patch for consideration in upcoming PostgreSQL releases.
>>>>
>>>> [1] Lock-free XLog Reservation from WAL:
>>>> https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
>>>> [2] Increase NUM_XLOGINSERT_LOCKS:
>>>> https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru
>>>
>>> Good day, Zhou.
>>>
>>> Could you explain, why your patch is correct?
>>>
>>> As far as I understand, it is clearly not correct at this time:
>>> - SHARED lock count may be incremented many times, because of for(;;) loop
>>> in LWLockAcquire and because LWLockAttemptLock is called twice per each
>>> loop iteration in case lock is held in Exclusive mode by someone else.
>>>
>>> If your patch is correct, then where I'm wrong?
>>>
>>> When I tried to do same thing, I did sub_fetch immediately in case of
>>> acquisition failure. And did no add_fetch at all if lock is held in
>>> Exclusive mode.
>>>

​​We rigorously tested your suggested approach of using sub_fetch on
acquisition failure to avoid temporary overcounting with [1][2] as the
baseline. Performance results on our 384-vCPU Intel system running
HammerDB/TPCC (192 VU, 757 warehouses) showed an NOPM improvement of
8.36% which is lower than the 20.81% brought by this optimization.

>>> BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
>>> to do CAS if lock is held by someone else.
>>>
>>> See my version in attach...
>>>
>>
>> Good day, Yura.
>>
>> Thanks for your comments which identifies this critical safety condition
>> – you're absolutely right that we must guarantee the shared reference
>> count never overflows into the exclusive bit. Let me clarify the safety
>> mechanism:
>>
>> When a reader encounters an exclusive lock during acquisition
>> (triggering the for(;;) loop), it does temporarily increment the shared
>> count twice – once per LWLockAttemptLock attempt. However, both
>> increments occur ​​before​​ the reader waits on its semaphore
>> (PGSemaphoreLock(proc->sem)). Crucially, when the exclusive holder
>> releases the lock via LWLockReleaseInternal, it ​​resets the entire lock
>> state​​ (line 1883: pg_atomic_fetch_and_u32(&lock->state,
>> ~LW_LOCK_MASK)). This clears all reader references, including any
>> "overcounted" increments from blocked readers.
>
> I see my mistake now: I misread this pg_atomic_fetch_and as
> pg_atomic_fetch_add.
>
> Clever trick. But rather unintuitive. It is hard to mean about its safety.
> It have to be described in details in code comments and commit message. But
> you completely missed description of this important nuance in first patch
> version.
>

We've significantly expanded the code comments and commit message
(attached) to explicitly document:
- The transient "overcount" scenario when readers encounter exclusive locks
- The safety boundary (MAX_BACKENDS * 2 references)
- The critical reset mechanism in LWLockReleaseInternal where
pg_atomic_fetch_and_u32(&lock->state, ~LW_LOCK_MASK) clears all references
- Conditions under which the optimization activates (only when
willwait=true)

>> Thus, when blocked readers wake:
>>
>> 1. They retry acquisition on a ​​zero-initialized state​​
>> 2. Each ultimately increments only ​​once​​ for successful acquisition
>> 3. The transient "overcount" (≤ MAX_BACKENDS × 2) stays safely within
>> LW_SHARED_MASK
>>
>> The key invariants are:
>>
>> - LW_SHARED_MASK = (MAX_BACKENDS << 1) + 1
>> - Exclusive release resets all shared bits
>> - Readers never persist >1 reference after wakeup
>>
>> Does this resolve the concern? I appreciate you flagging this subtlety –
>> please correct me if I've misunderstood your scenario or misinterpreted
>> the code.
>>
>> And I'd appreciate you for sharing your implementation – I particularly
>> agree with your optimization for ​​exclusive lock acquisition​​.
>> Avoiding the CAS loop when the lock is already held (by checking state
>> early) is a clever reduction of atomic operations, which we know are
>> costly on high-core-count systems. I’ll prioritize evaluating this for
>> our HammerDB/TPROC-C workload and share benchmark results soon.
>

While testing your exclusive lock optimization (avoiding CAS when lock
is held), we observed neutral performance (compared with this
optimization) in TPCC as the workload isn't exclusive-lock intensive. We
recognize its potential value and will:
- Find another dedicated benchmark to quantify its gains
- Propose it as a separate optimization once validated

> This done the same for shared lock: if lock is locked as exclusive, no
> fetch_add is performed. And read before fetch_add is not expensive, I
> believe. But I didn't test it on a such huge machine as yours, so I could
> be mistaken.
>

We also tried to early give up acquiring shared lock when it is locked
as exclusive by inserting an atomic_read before the fetch_add. The
performance test shows an improvement of 8.4% which is lower than the
20.81% brought by the fetch_add-only implementation, suggesting that the
additional atomic_read operation is also critical on the high-core-count
systems.

>> Regarding ​​shared locks​​: Your version (using sub_fetch on acquisition
>> failure) does align more cleanly with the original state machine by
>> avoiding transient overcounts. I initially came up with a similar
>> approach but shifted to the single-atomic-increment design to minimize
>> atomic instructions – a critical priority for our 384-core benchmarks
>> where atomic ops dominate contention.
>>
>> Let’s reconcile these strengths:
>>
>> 1. I’ll test your patch head-to-head against our current version in HCC
>> TPROC-C workloads.
>> 2. If the atomic savings in your exclusive path yield meaningful gains,
>> we will try to integrate it into our patch immediately.
>> 1. For shared locks: if your design shows comparable performance while
>> simplifying correctness, it’s a compelling option.
>>
>> Really appreciate you driving this optimization further!
>
> I appreciate your work too!
> Scalability has long starving increased attention.
>
Good day, Yura!

Thank you for your thoughtful review and critical feedback on the LWLock
optimization patch. Your insights into the potential shared count
overflow issue were especially valuable and prompted us to thoroughly
re-examine the safety mechanisms.

We've preserved the original optimization due to its demonstrated
performance advantages but incorporated your critical documentation
enhancements. The revised patch now includes:

- Comprehensive safety explanations in lwlock.c comments
- Detailed commit message describing transient states

We believe this addresses the correctness concerns of our readers while
delivering significant scalability gains. We welcome further scrutiny
and would appreciate your thoughts on the revised documentation.

Thank you again for your expertise in strengthening this optimization!

Regards,
Zhiguo

[1] Lock-free XLog Reservation from WAL:
https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
[2] Increase NUM_XLOGINSERT_LOCKS:
https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru

Attachment Content-Type Size
v1-0001-Optimize-shared-LWLock-acquisition-for-high-core-cou.patch text/plain 10.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2025-07-21 16:40:22 Re: POC: Parallel processing of indexes in autovacuum
Previous Message Peter Geoghegan 2025-07-21 15:59:24 Re: Issues with hash and GiST LP_DEAD setting for kill_prior_tuple