Re: Fix performance degradation of contended LWLock on NUMA

From: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: jesper(dot)pedersen(at)redhat(dot)com, pgsql-hackers(at)postgresql(dot)org, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Fix performance degradation of contended LWLock on NUMA
Date: 2017-10-20 08:54:04
Message-ID: cace5dc9bfec2ae206db56968fe3575e@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

On 2017-10-19 19:46, Andres Freund wrote:
> On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote:
>> > > + init_local_spin_delay(&delayStatus);
>> >
>> > The way you moved this around has the disadvantage that we now do this -
>> > a number of writes - even in the very common case where the lwlock can
>> > be acquired directly.
>>
>> Excuse me, I don't understand fine.
>> Do you complain against init_local_spin_delay placed here?
>
> Yes.

I could place it before perform_spin_delay under `if (!spin_inited)` if
you
think it is absolutely must.

>
>
>> Placing it in other place will complicate code.
>
>
>> Or you complain against setting `mask` and `add`?
>
> That seems right.
>
>
>> In both cases, I think simpler version should be accepted first. It
>> acts
>> as algorithm definition. And it already gives measurable improvement.
>
> Well, in scalability. I'm less sure about uncontended performance.
>
>
>
>> > > + * We intentionally do not call finish_spin_delay here, because
>> > > the loop
>> > > + * above usually finished by queuing into the wait list on
>> > > contention, and
>> > > + * doesn't reach spins_per_delay thereby doesn't sleep inside of
>> > > + * perform_spin_delay. Also, different LWLocks has very different
>> > > + * contention pattern, and it is wrong to update spin-lock
>> > > statistic based
>> > > + * on LWLock contention.
>> > > + */
>> >
>> > Huh? This seems entirely unconvincing. Without adjusting this here we'll
>> > just spin the same way every iteration. Especially for the case where
>> > somebody else holds LW_FLAG_LOCKED that's quite bad.
>>
>> LWLock's are very different. Some of them are always short-term
>> (BufferLock), others are always locked for a long time.
>
> That seems not particularly relevant. The same is true for
> spinlocks. The relevant question isn't how long the lwlock is held,
> it's
> how long LW_FLAG_LOCKED is held - and that should only depend on
> contention (i.e. bus speed, amount of times put into sleep while
> holding
> lock, etc), not on how long the lock is held.
>
>> I've tried to place this delay into lock itself (it has 2 free bytes),
>> but this attempt performed worse.
>
> That seems unsurprising - there's a *lot* of locks, and we'd have to
> tune all of them. Additionally there's a bunch of platforms where we do
> *not* have free bytes (consider the embedding in BufferTag).
>
>
>> Now I understand, that delays should be stored in array indexed by
>> tranche. But I have no time to test this idea. And I doubt it will
>> give
>> cardinally better results (ie > 5%), so I think it is better to accept
>> patch in this way, and then experiment with per-tranche delay.
>
> I don't think tranches have any decent predictive power here.

Look after "Make acquiring LWLock to look more like spinlock".
First `skip_wait_list` iterations there is no attempt to queue self into
wait list. It gives most of improvement. (without it there is just no
degradation on high number of clients, but a little decline on low
clients, because current algorithm is also a little `spinny` (because
it attempts to acquire lock again after queueing into wait list)).

`skip_wait_list` depends on tranche very much. And per-tranche
`skip_wait_list` should be calculated based on ability to acquire lock
without any sleep, ie without both queuing itself into wait list and
falling into `pg_usleep` inside `perform_spin_delay` . I suppose, it is
obvious that `spins_per_delay` have to be proportional to
`skip_wait_list`
(it should be at least greater than `skip_wait_list`). Therefore
`spins_per_delay` is also should be runtime-dependent on lock's tranche.

Am I wrong?

Without that per-tranche auto-tuning, it is better to not touch
`spins_per_delay` inside of `LWLockAttemptLockOrQueue`, I think.

With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Leon Winter 2017-10-20 09:11:01 Re: Cursor With_Hold Performance Workarounds/Optimization
Previous Message Kyotaro HORIGUCHI 2017-10-20 08:37:07 Re: asynchronous execution