The test case I just posted shows that our spinlock code, which we had
thought largely done, is once again becoming a performance bottleneck.
It's time to resurrect some of the ideas we kicked around in early
2002, and didn't pursue because we decided spinlocks weren't our major
I started investigating this by trying to understand why the "futex
spinlock" patch had helped the Red Hat customer I mentioned, when it
hadn't done anything much in last year's experiments. It turns out
that this customer was using an 8-way Opteron, and the futex patch
helps a lot more on that architecture than others. Some random
experimentation eventually turned up part of the reason: the x86 TAS
assembly code we are using is pessimal for x86_64. Specifically,
s_lock.h uses this for both architectures:
/* Use a non-locking test before asserting the bus lock */
" cmpb $0,%1 \n"
" jne 1f \n"
" lock \n"
" xchgb %0,%1 \n"
and it turns out that deleting the cmpb and jne makes things go
significantly faster on Opterons. So the "non-locking test" is
not a win at all on x86_64. As best I can tell from testing,
removing it wins because it causes the rate of spinlock delays
to drop significantly. Why this should be is not completely
obvious. I speculate that the Opterons are capable of pulling
in a copy of the cache line that contains the spinlock and then
running 100 iterations of the cmpb test without ever noticing
that the processor holding the lock has now released it. Without
the cmpb, they are forced by the locking xchgb test to actually
look at the real state of the spinlock each time.
I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors: it can only be a win if you expect a lot
of contention for the spin lock, but in percentage terms we still
have a very low conflict rate, so in most executions of the TAS
macro, the cmpb is just wasted cycles. Bottom line: we definitely
don't want it for x86_64, and maybe not at all, but we need more
research to decide the latter.
The second reason that the futex patch is helping is that when
a spinlock delay does occur, it allows the delaying process to be
awoken almost immediately, rather than delaying 10 msec or more
as the existing code does. However, given that we are only expecting
the spinlock to be held for a couple dozen instructions, using the
kernel futex mechanism is huge overkill --- the in-kernel overhead
to manage the futex state is almost certainly several orders of
magnitude more than the delay we actually want.
I looked into several other methods of doing the spinlock delay
instead. I think all of these were suggested at one point or
another in our earlier discussions of spinlocks:
1. Use sched_yield() if available: it does just what we want,
ie, yield the processor without forcing any useless time delay
before we can be rescheduled. This doesn't exist everywhere
but it exists in recent Linuxen, so I tried it. It made for a
beautiful improvement in my test case numbers: CPU utilization
went to 100% and the context swap rate to almost nil. Unfortunately,
I also saw fairly frequent "stuck spinlock" panics when running
more queries than there were processors --- this despite increasing
NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
(I speculate that it's set up to only yield the processor to other
processes already affiliated to that processor. In any case, it
is definitely capable of getting through 10000 yields without
running the guy who's holding the spinlock.)
2. Reduce the length of the select() delays in s_lock. The current code
delays in quanta of 10msec, but on recent Linuxen (and I think other
platforms too) the scheduling quantum is 1msec, so we can get the
processor back faster if we ask for a 1msec delay. I tried this and it
is definitely a win on Linux; it doesn't seem to hurt anything on older
systems either, they just round the delay up to 10msec like before.
So I think we should do this, even though it's only a partial answer.
3. Modify the spin loop to do a little more work between TAS() tests.
In the existing code there are only about half a dozen instructions
total in the normal spin loop, most of them very fast, with the result
that the spinning processor does its best to monopolize the system bus
with locked probe instructions. This is obviously not good, as it'll
interfere with the ability of the spinlock holder to complete its work
and release the lock. (The bulk of the spinlock uses are for LWLocks,
and with the current data structures the LWLock's own state is usually
going to be in the same cache line as the spinlock proper, so that
cache grabs for the spinlock hurt the LWLock state updating as well
as the eventual spinlock release.) I modified the code to use an
integer modulo operation as part of the test for SPINS_PER_DELAY;
on most processors this should keep the processor's attention and
keep it away from the system bus for several times longer than an
average instruction. (It's relevant here that the "rep;nop" used
to slow the spin loop on recent Intel processors appears to be
completely ignored by Opterons. Basically we need a non-hardware-
specific trick to slow the loop.)
4. Modify SPINS_PER_DELAY. Looping more times instead of entering the
kernel is a win for SMP boxes: whenever you give up and call the kernel
to delay, it's likely that the spinlock holder will have released the
lock before you even get all the way into the kernel, let alone get
rescheduled again. I found that pushing SPINS_PER_DELAY above 200
caused the select delay rate to fall to almost nil in this test case.
However, that would be counterproductive for uniprocessors where the
optimal strategy is normally to give up the processor immediately.
We've talked about making the spin count user configurable, but I never
cared for that answer. Instead, I've worked up some code to
automatically adjust the spin count depending on whether we seem to be
in a uniprocessor or multiprocessor. The idea is to decrease the spin
count (down to a lower limit) whenever s_lock has to delay in order to
get a lock, and to increase the count (up to a limit) whenever we see
the lock come free before we've had to delay. The former condition
suggests that we are in a uniprocessor and the latter suggests we are in
a multiprocessor. I recall having complained that this idea would
probably be unstable, but in testing it seems to do pretty nicely at
converging to the minimum spin count on a 1-CPU machine and the maximum
limit on an SMP box. It could do with more testing though to see if it
works the same for other people.
My proposal therefore is to do #2, #3, and #4, and to modify the TAS
assembly code at least on x86_64. Together, these changes bring
my test case on a 4-way Opteron from
N, runtime: 1 36s 2 61s 4 105s 8 198s
N, runtime: 1 35s 2 48s 4 58s 8 105s
At 4 queries, CPU utilization is 100% and the context swap rate nearly
nil, which seems to mean we've gained another (temporary no doubt)
victory over the spinlock problem.
I attach two proposed patches: the first removes the cmpb/jne from
the x86 TAS assembly code, and the second one does the s_lock changes
enumerated as points #2, #3, #4. The first one in particular needs
more testing to see if it hurts performance on any non-Opteron x86
chips. (If so, we'd just make it conditional to x86_64.)
Comments and testing invited.
regards, tom lane
pgsql-hackers by date
|Next:||From: Kurt Roeckx||Date: 2005-09-11 22:23:46|
|Subject: Re: Spinlocks, yet again: analysis and proposed patches|
|Previous:||From: Tom Lane||Date: 2005-09-11 21:51:09|
|Subject: Spinlocks, yet again: a new test case|