Re: Fix performance degradation of contended LWLock on NUMA

From: Andres Freund <andres(at)anarazel(dot)de>
To: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
Cc: jesper(dot)pedersen(at)redhat(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Fix performance degradation of contended LWLock on NUMA
Date: 2017-10-19 00:03:34
Message-ID: 20171019000334.r7f6p4ypkki3x5tq@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-09-08 22:35:39 +0300, Sokolov Yura wrote:
> From 722a8bed17f82738135264212dde7170b8c0f397 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
> Date: Mon, 29 May 2017 09:25:41 +0000
> Subject: [PATCH 1/6] More correct use of spinlock inside LWLockWaitListLock.
>
> SpinDelayStatus should be function global to count iterations correctly,
> and produce more correct delays.
>
> Also if spin didn't spin, do not count it in spins_per_delay correction.
> It wasn't necessary before cause delayStatus were used only in contented
> cases.

I'm not convinced this is benefial. Adds a bit of overhead to the
casewhere LW_FLAG_LOCKED can be set immediately. OTOH, it's not super
likely to make a large difference if the lock has to be taken anyway...

> +
> +/* just shortcut to not declare lwlock_stats variable at the end of function */
> +static void
> +add_lwlock_stats_spin_stat(LWLock *lock, SpinDelayStatus *delayStatus)
> +{
> + lwlock_stats *lwstats;
> +
> + lwstats = get_lwlock_stats_entry(lock);
> + lwstats->spin_delay_count += delayStatus->delays;
> +}
> #endif /* LWLOCK_STATS */

This seems like a pretty independent change.

> /*
> * Internal function that tries to atomically acquire the lwlock in the passed
> - * in mode.
> + * in mode. If it could not grab the lock, it doesn't puts proc into wait
> + * queue.
> *
> - * This function will not block waiting for a lock to become free - that's the
> - * callers job.
> + * It is used only in LWLockConditionalAcquire.
> *
> - * Returns true if the lock isn't free and we need to wait.
> + * Returns true if the lock isn't free.
> */
> static bool
> -LWLockAttemptLock(LWLock *lock, LWLockMode mode)
> +LWLockAttemptLockOnce(LWLock *lock, LWLockMode mode)

This now has become a fairly special cased function, I'm not convinced
it makes much sense with the current naming and functionality.

> +/*
> + * Internal function that tries to atomically acquire the lwlock in the passed
> + * in mode or put it self into waiting queue with waitmode.
> + * This function will not block waiting for a lock to become free - that's the
> + * callers job.
> + *
> + * Returns true if the lock isn't free and we are in a wait queue.
> + */
> +static inline bool
> +LWLockAttemptLockOrQueueSelf(LWLock *lock, LWLockMode mode, LWLockMode waitmode)
> +{
> + uint32 old_state;
> + SpinDelayStatus delayStatus;
> + bool lock_free;
> + uint32 mask,
> + add;
> +
> + AssertArg(mode == LW_EXCLUSIVE || mode == LW_SHARED);
> +
> + if (mode == LW_EXCLUSIVE)
> + {
> + mask = LW_LOCK_MASK;
> + add = LW_VAL_EXCLUSIVE;
> + }
> + else
> + {
> + mask = LW_VAL_EXCLUSIVE;
> + add = LW_VAL_SHARED;
> + }
> +
> + init_local_spin_delay(&delayStatus);

The way you moved this around has the disadvantage that we now do this -
a number of writes - even in the very common case where the lwlock can
be acquired directly.

> + /*
> + * Read once outside the loop. Later iterations will get the newer value
> + * either via compare & exchange or with read after perform_spin_delay.
> + */
> + old_state = pg_atomic_read_u32(&lock->state);
> + /* loop until we've determined whether we could acquire the lock or not */
> + while (true)
> + {
> + uint32 desired_state;
> +
> + desired_state = old_state;
> +
> + lock_free = (old_state & mask) == 0;
> + if (lock_free)
> {
> - if (lock_free)
> + desired_state += add;
> + if (pg_atomic_compare_exchange_u32(&lock->state,
> + &old_state, desired_state))
> {
> /* Great! Got the lock. */
> #ifdef LOCK_DEBUG
> if (mode == LW_EXCLUSIVE)
> lock->owner = MyProc;
> #endif
> - return false;
> + break;
> + }
> + }
> + else if ((old_state & LW_FLAG_LOCKED) == 0)
> + {
> + desired_state |= LW_FLAG_LOCKED | LW_FLAG_HAS_WAITERS;
> + if (pg_atomic_compare_exchange_u32(&lock->state,
> + &old_state, desired_state))
> + {
> + LWLockQueueSelfLocked(lock, waitmode);
> + break;
> }
> - else
> - return true; /* somebody else has the lock */
> + }
> + else
> + {
> + perform_spin_delay(&delayStatus);
> + old_state = pg_atomic_read_u32(&lock->state);
> }
> }
> - pg_unreachable();
> +#ifdef LWLOCK_STATS
> + add_lwlock_stats_spin_stat(lock, &delayStatus);
> +#endif
> +
> + /*
> + * We intentionally do not call finish_spin_delay here, because the loop
> + * above usually finished by queuing into the wait list on contention, and
> + * doesn't reach spins_per_delay thereby doesn't sleep inside of
> + * perform_spin_delay. Also, different LWLocks has very different
> + * contention pattern, and it is wrong to update spin-lock statistic based
> + * on LWLock contention.
> + */

Huh? This seems entirely unconvincing. Without adjusting this here we'll
just spin the same way every iteration. Especially for the case where
somebody else holds LW_FLAG_LOCKED that's quite bad.

> From e5b13550fc48d62b0b855bedd7fcd5848b806b09 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
> Date: Tue, 30 May 2017 18:54:25 +0300
> Subject: [PATCH 5/6] Fix implementation description in a lwlock.c .
>
> ---
> src/backend/storage/lmgr/lwlock.c | 17 ++++++++---------
> 1 file changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
> index 334c2a2d96..0a41c2c4e2 100644
> --- a/src/backend/storage/lmgr/lwlock.c
> +++ b/src/backend/storage/lmgr/lwlock.c
> @@ -62,16 +62,15 @@
> * notice that we have to wait. Unfortunately by the time we have finished
> * queuing, the former locker very well might have already finished it's
> * work. That's problematic because we're now stuck waiting inside the OS.
> -
> - * To mitigate those races we use a two phased attempt at locking:
> - * Phase 1: Try to do it atomically, if we succeed, nice
> - * Phase 2: Add ourselves to the waitqueue of the lock
> - * Phase 3: Try to grab the lock again, if we succeed, remove ourselves from
> - * the queue
> - * Phase 4: Sleep till wake-up, goto Phase 1
> *
> - * This protects us against the problem from above as nobody can release too
> - * quick, before we're queued, since after Phase 2 we're already queued.
> + * This race is avoided by taking a lock for the wait list using CAS with the old
> + * state value, so it would only succeed if lock is still held. Necessary flag
> + * is set using the same CAS.
> + *
> + * Inside LWLockConflictsWithVar the wait list lock is reused to protect the variable
> + * value. First the lock is acquired to check the variable value, then flags are
> + * set with a second CAS before queuing to the wait list in order to ensure that the lock was not
> + * released yet.
> * -------------------------------------------------------------------------
> */

I think this needs more extensive surgery.

> From cc74a849a64e331930a2285e15445d7f08b54169 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
> Date: Fri, 2 Jun 2017 11:34:23 +0000
> Subject: [PATCH 6/6] Make SpinDelayStatus a bit lighter.
>
> It saves couple of instruction of fast path of spin-loop, and so makes
> fast path of LWLock a bit faster (and in other places where spinlock is
> used).

> Also it makes call to perform_spin_delay a bit slower, that could
> positively affect on spin behavior, especially if no `pause` instruction
> present.

Whaa? That seems pretty absurd reasoning.

One big advantage of this is that we should be able to make LW_SHARED
acquisition an xadd. I'd done that previously, when the separate
spinlock still existed, and it was quite beneficial for performance on
larger systems. But it was some fiddly code - should be easier now.
Requires some careful management when noticing that an xadd acquired a
shared lock even though there's a conflicting exclusive lock, but
increase the fastpath.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-10-19 02:07:03 Re: BUG #14849: jsonb_build_object doesn't like VARIADIC calls very much
Previous Message Badrul Chowdhury 2017-10-18 23:35:45 Re: Re: protocol version negotiation (Re: Libpq PGRES_COPY_BOTH - version compatibility)