Re: spinlock contention

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: spinlock contention
Date: 2011-06-28 18:33:57
Message-ID: C788168C-E1E4-4186-91AC-62CC3CA3BE0A@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Jun23, 2011, at 23:40 , Robert Haas wrote:
>>> I tried rewriting the LWLocks using CAS. It actually seems to make
>>> things slightly worse on the tests I've done so far, perhaps because I
>>> didn't make it respect spins_per_delay. Perhaps fetch-and-add would
>>> be better, but I'm not holding my breath. Everything I'm seeing so
>>> far leads me to the belief that we need to get rid of the contention
>>> altogether, not just contend more quickly.

I got curious and put a tool together to benchmark this. The code can
be found at https://github.com/fgp/lockbench.

The tool starts N workers who each acquire a lock in SHARE mode, increment
a per-worker counter and release the lock again, rinse, repeat. That is
done for T seconds, after which it reports the sum of the individual
counters, the elapsed (wall) time and the sums of the user and system times
consumed by the workers. The lock implementations currently supported are

atomicinc: RW-Lock using atomic increment/decrement instructions
(locked xaddl) to acquire and release the lock in SHARE
mode in the non-contested case. The contested case is
unimplemented.

cmpxchng: Like atomicinc, but uses "locked cmpxchng" loop instead of
"locked xaddl".

spin: RW-Lock built around a simple cmpxchng-based spinlock (i.e.,
to lock/unlock spinlock is taken, shared-lockers count is
incremented/ decremented, spinlock is released). Similar to
pg_lwlock, but without the sleep() in the contested path of
the spinlock. The contested case of the RW-Lock is again
unimplemented.

none: No locking.

pg_lwlock: Postgres LWLock implementation. The contested case doesn't
work because the workers currently don't have a valid MyProc.

pw_lwlock_cas: Like pg_lwlock but with your CAS patch applied.

On an 4-core Intel Xeon system (Intel Xeon X3430, 2.4 GHz, 8MB cache) I get
the following results:

| 1 worker | 2 workers | 3 workers | 4 workers |
|wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec|
| / / | / / | / / | / / |
|cycles cycles|cycles cycles|cycles cycles|cycles cycles|
--------------------------------------------------------------------------
none |2.5e-09 2.5e-09|1.3e-09 2.5e-09|8.5e-10 2.5e-09|6.8e-10 2.5e-09|
atomicinc|1.8e-08 1.8e-08|2.8e-08 5.6e-08|2.7e-08 8.1e-08|2.9e-08 1.1e-07|
cmpxchng |3.0e-08 3.0e-08|7.1e-08 1.4e-07|6.9e-08 2.0e-07|7.2e-08 2.9e-07|
spin |5.0e-08 5.0e-08|3.0e-07 5.9e-07|3.8e-07 1.1e-06|4.0e-07 1.5e-06|
pg_lwlock|2.8e-08 2.8e-08|2.9e-08 3.0e-08|3.0e-08 3.2e-08|2.9e-08 3.1e-08|
pg_lwlock|2.6e-08 2.6e-08|6.2e-08 1.2e-07|7.7e-08 2.3e-07|7.8e-08 3.1e-07|
_cas| | | | |
--------------------------------------------------------------------------

These results allow the following conclusions to be drawn

First, our current pg_lwlock is quite good at preserving resources, i.e.
at not spinning excessively - at least for up to four workers. It's the only
lock implementation that consistently uses about 30ns of processor time for
one acquire/release cycle. Only atomicinc with one worker (i.e. no cachline
fighting) manages to beat that. It does, however, effectively serializate
execution with this workload - the consumed user time is approximately equal
to the elapsed wall clock time, even in the case of 4 workers, meaning that
most of the time 3 of those 4 workers are sleeping.

Secondly, pg_lwlock_cas and cmpxchng perform simiarly, which comes at no
surprise, since use effectively the same algorithm. One could expect spin
and pg_lwlock to also perform similarly, but these two don't. The reason
is probably that spin uses a rather brain-dead spinning method, while
pg_lwlock uses "rep; nop".

Now to atomicinc, which is (the fast-path of) the most optimized RW-lock
possible, at least on x86 and x86_64. One interesting result here is that
wall time seconds / cycle are suspiciously for atomicinc and pg_lwlock.
This proves, I think, that the limiting factor in both of these tests is
the speed at which cache line invalidation can occur. It doesn't seem to
matter whether the other cores are idle (as in the pg_lwlock case), or
trying to obtain ownership of the cache line themselves (as in the
atomicinc case).

Finally, the no-locking test (none) shows that, even though the counter
is written to shared memory after every increment, memory bandwith isn't
an issue here, since performance scaled nearly linearly with the number
of workers here.

Here's the result of another run, this time with the per-worker counter
being increment 100 in each acquire/release cycle.

------------------- 100 Increments per Cycle -----------------------------
| 1 worker | 2 workers | 3 workers | 4 workers |
|wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec|
| / / | / / | / / | / / |
|cycles cycles|cycles cycles|cycles cycles|cycles cycles|
--------------------------------------------------------------------------
none |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|6.5e-08 2.6e-07|
atomicinc|2.5e-07 2.5e-07|1.3e-07 2.6e-07|8.6e-08 2.5e-07|6.5e-08 2.6e-07|
cmpxchng |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.6e-08 2.6e-07|7.8e-08 3.1e-07|
spin |3.0e-07 3.0e-07|1.5e-07 3.0e-07|1.1e-07 3.2e-07|3.4e-07 1.4e-06|
pg_lwlock|2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|1.2e-07 4.8e-07|
pg_lwlock|2.6e-07 2.6e-07|1.6e-07 3.1e-07|1.1e-07 3.2e-07|8.7e-08 3.4e-07|
_cas| | | | |
--------------------------------------------------------------------------

This makes the overhead of the locking disappear in the noise in the
single-worker case. With 4 workers, however, differences materialize.
The no-locking case still shows approximately linear speedup though,
which I think again rules out memory bandwith effects.

Now, atomicinc is the only implementation which doesn't perform
significantly worse than the no-lock case. The closest competitors are
the CAS-based implementation cmpxchng and pg_lwlock_cas, with 20% and
33% slowdown respectively (of both wall seconds and user seconds per cycle).

The current LWLock implementation falls behind now, taking about twice as
many seconds per cycle compared to atomicinc.

In other words, with this workload an atomicinc-based RW-lock, taken in
SHARE mode should add virtually no overhead, while every other
implementation adds at least 20%.

It'd be interesting to get benchmark results also on machines with more
than four cores, but I don't have access to any such machine ATM. So
if someone could run that on a 8 or 16-core beast, that'd be great. The
code compiles on Mac OS X and Ubuntu 10.04. It's fairly portable, to
make the postgres parts compile on other platforms, it might be necessary
to adjust pg_stub/pg_config_manual.h.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2011-06-28 18:38:06 Re: marking old branches as no longer maintained
Previous Message Robert Haas 2011-06-28 18:31:21 Re: SSI modularity questions