Quick Links

Re: Better LWLocks with compare-and-swap (9.4)

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Better LWLocks with compare-and-swap (9.4)
Date:	2013-05-20 21:20:10
Message-ID:	519A938A.1070903@vmware.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 16.05.2013 01:08, Daniel Farina wrote:
> On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
>> I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:
>>
>> - 64.09% postgres postgres [.] tas
>> - tas
>> - 99.83% s_lock
>> - 53.22% LWLockAcquire
>> + 99.87% GetSnapshotData
>> - 46.78% LWLockRelease
>> GetSnapshotData
>> + GetTransactionSnapshot
>> + 2.97% postgres postgres [.] tas
>> + 1.53% postgres libc-2.13.so [.] 0x119873
>> + 1.44% postgres postgres [.] GetSnapshotData
>> + 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable
>> + 1.18% postgres postgres [.] AllocSetAlloc
>> ...
>>
>> So, on this test, a lot of time is wasted spinning on the mutex of
>> ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
>> surprisingly steep drop in performance once you go beyond 29 clients
>> (attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
>> is that after that point all the cores are busy, and processes start to be
>> sometimes context switched while holding the spinlock, which kills
>> performance.
>
> I have, I also used linux perf to come to this conclusion, and my
> determination was similar: a system was undergoing increasingly heavy
> load, in this case with processes>> number of processors. It was
> also a phase-change type of event: at one moment everything would be
> going great, but once a critical threshold was hit, s_lock would
> consume enormous amount of CPU time. I figured preemption while in
> the spinlock was to blame at the time, given the extreme nature

Stop the press! I'm getting the same speedup on that 32-core box I got
with the compare-and-swap patch, from this one-liner:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;

#define TAS(lock) tas(lock)

+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
static __inline__ int
tas(volatile slock_t *lock)
{

So, on this system, doing a non-locked test before the locked xchg
instruction while spinning, is a very good idea. That contradicts the
testing that was done earlier when the x86-64 implementation was added,
as we have this comment in the tas() implementation:

> /*
> * On Opteron, using a non-locking test before the locking instruction
> * is a huge loss. On EM64T, it appears to be a wash or small loss,
> * so we needn't bother to try to distinguish the sub-architectures.
> */

On my test system, the non-locking test is a big win. I tested this
because I was reading this article from Intel:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.
It says very explicitly that the non-locking test is a good idea:

> Spinning on volatile read vs. spin on lock attempt
>
> One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free.

Now, I'm not sure what to do about this. If we put the non-locking test
in there, according to the previous testing that would be a huge loss on
Opterons.

Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY.
That way, even if each TAS_SPIN test is very expensive, we don't spend
too much time spinning if it's really busy, or held by a process that is
sleeping.

- Heikki

In response to

Re: Better LWLocks with compare-and-swap (9.4) at 2013-05-15 22:08:19 from Daniel Farina

Responses

Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4)) at 2013-08-28 17:06:04 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robins Tharakan	2013-05-20 21:28:25	Re: Add more regression tests for dbcommands
Previous Message	Simon Riggs	2013-05-20 21:00:59	Re: Fast promotion failure