Re: Fix performance degradation of contended LWLock on NUMA

From: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: jesper(dot)pedersen(at)redhat(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fix performance degradation of contended LWLock on NUMA
Date: 2017-10-19 11:36:56
Message-ID: 70f8f2b672db8e42b1f5d524e201acb0@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-10-19 03:03, Andres Freund wrote:
> Hi,
>
> On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote:
>> /*
>> * Internal function that tries to atomically acquire the lwlock in
>> the passed
>> - * in mode.
>> + * in mode. If it could not grab the lock, it doesn't puts proc into
>> wait
>> + * queue.
>> *
>> - * This function will not block waiting for a lock to become free -
>> that's the
>> - * callers job.
>> + * It is used only in LWLockConditionalAcquire.
>> *
>> - * Returns true if the lock isn't free and we need to wait.
>> + * Returns true if the lock isn't free.
>> */
>> static bool
>> -LWLockAttemptLock(LWLock *lock, LWLockMode mode)
>> +LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode)
>
> This now has become a fairly special cased function, I'm not convinced
> it makes much sense with the current naming and functionality.

It just had not to be as complex as LWLockAttpemptLockOrQueueSelf.
Their functionality is too different.

>> +/*
>> + * Internal function that tries to atomically acquire the lwlock in
>> the passed
>> + * in mode or put it self into waiting queue with waitmode.
>> + * This function will not block waiting for a lock to become free -
>> that's the
>> + * callers job.
>> + *
>> + * Returns true if the lock isn't free and we are in a wait queue.
>> + */
>> +static inline bool
>> +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode,
>> LWLockMode waitmode)
>> +{
>> + uint32 old_state;
>> + SpinDelayStatus delayStatus;
>> + bool lock_free;
>> + uint32 mask,
>> + add;
>> +
>> + AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
>> +
>> + if (mode == LW_EXCLUSIVE)
>> + {
>> + mask = LW_LOCK_MASK;
>> + add = LW_VAL_EXCLUSIVE;
>> + }
>> + else
>> + {
>> + mask = LW_VAL_EXCLUSIVE;
>> + add = LW_VAL_SHARED;
>> + }
>> +
>> + init_local_spin_delay(&delayStatus);
>
> The way you moved this around has the disadvantage that we now do this
> -
> a number of writes - even in the very common case where the lwlock can
> be acquired directly.

Excuse me, I don't understand fine.
Do you complain against init_local_spin_delay placed here?
Placing it in other place will complicate code.

Or you complain against setting `mask` and `add`?
At least it simplifies loop body. For example, it is simpler to write
specialized Power assembly using already filled `mask+add` registers
(i have a working draft with power assembly).

In both cases, I think simpler version should be accepted first. It acts
as algorithm definition. And it already gives measurable improvement.
After some testing, attempt to improve it may be performed (assuming
release will be in a next autumn, there is enough time for testing and
improvements)
I can be mistaken.

>> + /*
>> + * We intentionally do not call finish_spin_delay here, because the
>> loop
>> + * above usually finished by queuing into the wait list on
>> contention, and
>> + * doesn't reach spins_per_delay thereby doesn't sleep inside of
>> + * perform_spin_delay. Also, different LWLocks has very different
>> + * contention pattern, and it is wrong to update spin-lock statistic
>> based
>> + * on LWLock contention.
>> + */
>
> Huh? This seems entirely unconvincing. Without adjusting this here
> we'll
> just spin the same way every iteration. Especially for the case where
> somebody else holds LW_FLAG_LOCKED that's quite bad.

LWLock's are very different. Some of them are always short-term
(BufferLock), others are always locked for a long time. Some sparse, and
some crowded. It is quite bad to tune single variable `spins_per_delay`
by such different load.

And it is already tuned by spin-locks. And spin-locks tune it quite
well.

I've tried to place this delay into lock itself (it has 2 free bytes),
but this attempt performed worse.
Now I understand, that delays should be stored in array indexed by
tranche. But I have no time to test this idea. And I doubt it will give
cardinally better results (ie > 5%), so I think it is better to accept
patch in this way, and then experiment with per-tranche delay.

>
>> From e5b13550fc48d62b0b855bedd7fcd5848b806b09 Mon Sep 17 00:00:00 2001
>> From: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
>> Date: Tue, 30 May 2017 18:54:25 +0300
>> Subject: [PATCH 5/6] Fix implementation description in a lwlock.c .
>>
>> ---
>> src/backend/storage/lmgr/lwlock.c | 17 ++++++++---------
>> 1 file changed, 8 insertions(+), 9 deletions(-)
>>
>> diff --git a/src/backend/storage/lmgr/lwlock.c
>> b/src/backend/storage/lmgr/lwlock.c
>> index 334c2a2d96..0a41c2c4e2 100644
>> --- a/src/backend/storage/lmgr/lwlock.c
>> +++ b/src/backend/storage/lmgr/lwlock.c
>> @@ -62,16 +62,15 @@
>> * notice that we have to wait. Unfortunately by the time we have
>> finished
>> * queuing, the former locker very well might have already finished
>> it's
>> * work. That's problematic because we're now stuck waiting inside
>> the OS.
>> -
>> - * To mitigate those races we use a two phased attempt at locking:
>> - * Phase 1: Try to do it atomically, if we succeed, nice
>> - * Phase 2: Add ourselves to the waitqueue of the lock
>> - * Phase 3: Try to grab the lock again, if we succeed, remove
>> ourselves from
>> - * the queue
>> - * Phase 4: Sleep till wake-up, goto Phase 1
>> *
>> - * This protects us against the problem from above as nobody can
>> release too
>> - * quick, before we're queued, since after Phase 2 we're already
>> queued.
>> + * This race is avoided by taking a lock for the wait list using CAS
>> with the old
>> + * state value, so it would only succeed if lock is still held.
>> Necessary flag
>> + * is set using the same CAS.
>> + *
>> + * Inside LWLockConflictsWithVar the wait list lock is reused to
>> protect the variable
>> + * value. First the lock is acquired to check the variable value,
>> then flags are
>> + * set with a second CAS before queuing to the wait list in order to
>> ensure that the lock was not
>> + * released yet.
>> *
>> -------------------------------------------------------------------------
>> */
>
> I think this needs more extensive surgery.

Again I don't understand what you mean. Looks like my English is not
very fluent :-(
Do you mean, there should be more thorough description?
Is description from letter is good enough to be copied into this
comment?

>> From cc74a849a64e331930a2285e15445d7f08b54169 Mon Sep 17 00:00:00 2001
>> From: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
>> Date: Fri, 2 Jun 2017 11:34:23 +0000
>> Subject: [PATCH 6/6] Make SpinDelayStatus a bit lighter.
>>
>> It saves couple of instruction of fast path of spin-loop, and so makes
>> fast path of LWLock a bit faster (and in other places where spinlock
>> is
>> used).
>
>> Also it makes call to perform_spin_delay a bit slower, that could
>> positively affect on spin behavior, especially if no `pause`
>> instruction
>> present.
>
> Whaa? That seems pretty absurd reasoning.

:-) AMD Opteron doesn't respect "pause" instruction at all (as it is
written in s_lock.h , and several times mentioned in Internet).
Given whole reason of `perfrom_spin_delay` is to spent time somewhere,
making it a tiny-bit slower may only improve things :-)

On 2017-10-19 02:28, Andres Freund wrote:
> On 2017-06-05 16:22:58 +0300, Sokolov Yura wrote:
>> Algorithm for LWLockWaitForVar is also refactored. New version is:
>> 1. If lock is not held by anyone, it immediately exit.
>> 2. Otherwise it is checked for ability to take WaitList lock, because
>> variable is protected with it. If so, CAS is performed, and if it is
>> successful, loop breaked to step 4.
>> 3. Otherwise spin_delay perfomed, and loop returns to step 1.
>> 4. var is checked for change.
>> If it were changed, we unlock wait list lock and exit.
>> Note: it could not change in following steps because we are holding
>> wait list lock.
>> 5. Otherwise CAS on setting necessary flags is performed. If it
>> succeed,
>> then queue Proc to wait list and exit.
>> 6. If CAS failed, then there is possibility for LWLock to be already
>> released - if so then we should unlock wait list and exit.
>> 7. Otherwise loop returns to step 5.
>>
>> So invariant is preserved:
>> - either we found LWLock free,
>> - or we found changed variable,
>> - or we set flags and queue self while LWLock were held.
>>
>> Spin_delay is not performed at step 7, because we should release wait
>> list lock as soon as possible.
>
> That seems unconvincing - by not delaying you're more likely to
> *increase* the time till the current locker that holds the lock can
> release the lock.

But why? If our CAS wasn't satisfied, then other's one were satisfied.
So there is always overall progress. And if it will sleep in this
loop, then other waiters will spin in first loop in this functions.
But given concrete usage of LWLockWaitForVar, probably it is not too
bad to hold other waiters in first loop in this function (they loops
until we release WaitListLock or lock released at whole).

I'm in doubts what is better. May I keep it unchanged now?

BTW, I found a small mistake in this place: I forgot to set
LW_FLAG_LOCKED in a state before this CAS. Looks like it wasn't real
error, because CAS always failed at first loop iteration (because real
`lock->state` had LW_FLAG_LOCKED already set), and after failed CAS
state adsorbs value from `lock->state`.
I'll fix it.

> One big advantage of this is that we should be able to make LW_SHARED
> acquisition an xadd. I'd done that previously, when the separate
> spinlock still existed, and it was quite beneficial for performance on
> larger systems. But it was some fiddly code - should be easier now.
> Requires some careful management when noticing that an xadd acquired a
> shared lock even though there's a conflicting exclusive lock, but
> increase the fastpath.

--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksandr Parfenov 2017-10-19 11:45:04 Re: I've just started working on Full Text Search with version 10 on Ubuntu 16
Previous Message Alvaro Herrera 2017-10-19 11:03:38 Re: show "aggressive" or not in autovacuum logs