Re: spinlock contention

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: spinlock contention
Date: 2011-06-23 19:32:37
Message-ID: BANLkTi=PKaXX4mn1av60arEWnY=944S8Jg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Jun23, 2011, at 17:42 , Robert Haas wrote:
>> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>>>> So, the majority (60%) of the excess spinning appears to be due to
>>>> SInvalReadLock.  A good chunk are due to ProcArrayLock (25%).
>>>
>>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>>> so I guess that two adjacent LWLocks currently share one cache line.
>>>
>>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>>> index 5, so if I'm not mistaken exactly the two locks where you saw
>>> the largest contention on are on the same cache line...
>>>
>>> Might make sense to try and see if these numbers change if you
>>> either make LWLockPadded 64bytes or arrange the locks differently...
>>
>> I fooled around with this a while back and saw no benefit.  It's
>> possible a more careful test would turn up something, but I think the
>> only real way forward here is going to be to eliminate some of that
>> locking altogether.
>
> It seems hard to believe that there isn't some effect of two locks
> sharing a cache line. There are architectures (even some of the
> Intel architectures, I believe) where cache lines are 32 bit, though -
> so measuring this certainly isn't easy. Also, I'm absolute no
> expert for this, so it might very well be that' I'm missing something
> fundamental.
>
>> SInvalReadLock is a good example of a lock which seems barely to be
>> necessary at all.  Except on extraordinarily rare occasions, it is
>> only ever taken in share mode.
>
> This is the case we should optimize for, I think. Currently, there's
> contention even between two SHARE lockers of a LWLock because our
> LWLock implementation uses a spinlock underneath. I've googled around
> a bit, and found this:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git
>
> It's an userspace rwlock implementation based on atomic instructions
> and futexes (for blocking) written by Linus Torwalds as a response
> to the following thread
>
> http://lwn.net/Articles/284741/
>
> It seems to require only a single atomic increment instruction to
> acquire a share lock in the uncontested case, and a single compare-
> and-exchange instruction to acquire an exlcusive lock (again in
> the uncontested case).
>
> The code doesn't seem to have a license attached, and seems to have
> been written over a weekend and never touched since, so we might
> not want (or be able to) to use this directly. But it'd still be
> interesting to see if replacing our LWLocks with this implementation
> affects throughput.
>
>> Only SICleanupQueue() needs to lock
>> out readers, and in the (probably common) case where all backends are
>> reasonably close to being caught up, it wouldn't even matter if the
>> determination of minMsgNum were a little bit fuzzy.  I've been
>> straining my brain trying to figure out whether there's some way to
>> rewrite this logic so that it can run concurrently with readers; then
>> we could get rid of SInvalReadLock altogether.  I have a feeling if I
>> were 25% smarter I could do it, but I'm not.
>
> This sounds like something which could be done with RCU (Read-Copy-Update)
> in a lockless way. The idea is to copy the data structure (or parts)
> of it before updating it, and to "commit" the change afterwards by
> adjusting a single pointer to point to the new structure (kinda like
> we do atomic commits by an atomic update of one bit in the clog).
>
> The hard part is usually to figure when it's OK to clean up or
> reuse the old version of the data structure - for that, you need to
> know that all previous readers have completed.
>
> Here's a rough description of how that could work for the invalidation
> queue. We'd have two queue structures in shared memory, each containing
> a version number. We'd also have a "current" pointer pointing to the
> most recent one. Each reader would publish the version number of the
> queue he's about to read (the "current" one) in shared memory
> (and issue a write barrier). SICleanupQueue() would check whether the
> non-current queue structure is unused by comparing its version to the
> published versions of all backends (after issuing a read barrier, and
> after taking a lock that prevents concurrent SICleanupQueue runs of
> course). If there still are users of that version, SICleanupQueue()
> out-waits them. Afterwards, it simply copies the current structure
> over the old one, does the necessary cleanups, increments the version
> number to be larger than the one of the "current" structure,
> and *then* updates the "current" pointer.
>
>> So my fallback position is to try to find some way to optimize the
>> common case where there are no messages to be read.  We're already
>> allowing readers and writers to run in parallel, so it must not be
>> important for a reader to see a message that a writer is in the
>> process of writing, but has not yet finished writing.  I believe the
>> idea here is that sinval messages should be emitted before releasing
>> the related locks, so if we've successfully acquired a lock on an
>> object, then any pertinent sinval messages must already be completely
>> written.  What I'm thinking we could do is just keep a global counter
>> indicating how many messages have been written, and each backend could
>> keep a counter indicating the highest-numbered message it has so far
>> read.  When a message is written, the global counter is bumped.  When
>> a backend goes to AcceptInvalidationMessages(), it compares the global
>> counter to its own counter, and if they're the same, then it doesn't
>> bother taking SInvalReadLock and just exits quickly.
>
> Sounds sensible. I do fear, however, that if we start to micro-optimize
> around every single LWLock we're in for quite a maintenance burden in
> the future. Even if each single modification is relatively simple, the
> complexity still adds ups. So I still believe that it'd be better to
> start with optimizing LWLocks in general, and to only touch the LWLocks
> which are still hot spots afterwards.
>
>> There is an obvious (potential) memory-ordering problem here.  If it's
>> possible for a backend doing AcceptInvalidationMessages() to see a
>> stale version of the counter, then it might fail to read messages from
>> the queue that it really ought to have seen.  Protecting the global
>> counter with a spinlock defeats the purpose of doing any of this in
>> the first place, because the whole problem is too many people fighting
>> over the spinlock protecting SInvalReadLock.  It should be sufficient,
>> though, for the writing backend to execute a memory-barrier operation
>> just after bumping the global counter and the reading backend to
>> execute one just before.  As it happens, LWLockRelease() is such a
>> barrier, since it must acquire and release a spinlock, so we should be
>> able to just bump the counter right before releasing the LWLock and
>> call it good.  On the reading side, the immediately-preceding lock
>> acquisition seems like it ought to be sufficient - surely it can't be
>> possible to acquire a heavyweight lock on an object without seeing
>> memory updates that some other process made before releasing the same
>> heavyweight lock.
>
> If we start doing lockless algorithms, I really think we should add
> explicit barrier primitives. We could start out with something
> simple such as
>
> #define MemoryBarrierWrite \
>  do { slock_t l; S_UNLOCK(l); } while (0)
> #define MemoryBarrierRead \
>  do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
> #define MemoryBarrierReadWrite \
>  do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
>
> But yeah, it seems that in the case of the sinval queue, the surrounding
> LWLocks actually provide the necessary barriers.
>
>> A possible objection here is that a 32-bit counter might wrap around,
>> giving a false negative, and a read from a 64-bit counter might not be
>> atomic.  In practice, the chances of a problem here seem remote, but I
>> think we can guard against it easily enough.  We can put each
>> backend's counter of messages read into shared memory, protected by a
>> per-backend lock (probably the same LWLock the fast relation lock
>> patch already introduces).  When a backend compares its counter to the
>> global counter, it does so while holding this lock.  When
>> SICleanupQueue() resets a backend, it grabs the lock and sets that
>> backend's counter to a special value (like 0) that we guarantee will
>> never match the global counter (which we can cause to wrap from 2^32-1
>> to 1).  So then AcceptInvalidationMessages() - in the common case
>> where there are no invalidation messages to process - will only need
>> the per-backend LWLock rather than fighting over SInvalLock.
>
> OTOH, don't all architectures that are interesting targets of
> performance tuning provide atomic access to 64-bit values? Even i386
> has an 8-byte compare-and-exchange instruction I think...

yup, 'lock cmpxchg8b'.
see: http://www.niallryan.com/node/137 for a way to do it via fpu
without a loop. the 'lock' is a bus lock.

merlin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Gurjeet Singh 2011-06-23 19:33:52 Re: SYNONYMS (again)
Previous Message Robert Haas 2011-06-23 19:02:13 Re: spinlock contention