|From:||Robert Haas <robertmhaas(at)gmail(dot)com>|
|To:||Florian Pflug <fgp(at)phlo(dot)org>|
|Subject:||Re: spinlock contention|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>>> So, the majority (60%) of the excess spinning appears to be due to
>>> SInvalReadLock. A good chunk are due to ProcArrayLock (25%).
>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>> so I guess that two adjacent LWLocks currently share one cache line.
>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>> index 5, so if I'm not mistaken exactly the two locks where you saw
>> the largest contention on are on the same cache line...
>> Might make sense to try and see if these numbers change if you
>> either make LWLockPadded 64bytes or arrange the locks differently...
> I fooled around with this a while back and saw no benefit. It's
> possible a more careful test would turn up something, but I think the
> only real way forward here is going to be to eliminate some of that
> locking altogether.
I did some benchmarking, on the 32-core system from Nate Boley, with
pgbench -n -S -c 80 -j 80. With master+fastlock+lazyvxid, I can hit
180-200k TPS in the 32-client range. But at 80 clients throughput
starts to fall off quite a bit, dropping down to about 80k TPS. So
then, just for giggles, I inserted a "return;" statement at the top of
AcceptInvalidationMessages(), turning it into a noop. Performance at
80 clients shot up to 210k TPS. I also tried inserting an
acquire-and-release cycle on MyProc->backendLock, which was just as
good. That seems like a pretty encouraging result, since there appear
to be several ways of reimplementing SIGetDataEntries() that would
work with a per-backend lock rather than a global one.
I did some other experiments, too. In the hopes of finding a general
way to reduce spinlock contention, I implemented a set of "elimination
buffers", where backends that can't get a spinlock go and try to
combine their requests and then send a designated representative to
acquire the spinlock and acquire shared locks simultaneously for all
group members. It took me a while to debug the code, and I still
can't get it to do anything other than reduce performance, which may
mean that I haven't found all the bugs yet, but I'm not feeling
encouraged. Some poking around suggests that the problem isn't that
spinlocks are routinely contended - it seems that we nearly always get
the spinlock right off the bat. I'm wondering if the problem may be
not so much that we have continuous spinlock contention, but rather
than every once in a while a process gets time-sliced out while it
holds a spinlock. If it's an important spinlock (like the one
protecting SInvalReadLock), the system will quickly evolve into a
state where every single processor is doing nothing but trying to get
that spinlock. Even after the hapless lock-holder gets to run again
and lets go of the lock, you have a whole pile of other backends who
are sitting there firing of lock xchgb in a tight loop, and they can
only get it one at a time, so you have ferocious cache line contention
until the backlog clears. Then things are OK again for a bit until
the same thing happens to some other backend. This is just a theory,
I might be totally wrong...
The Enterprise PostgreSQL Company
|Next Message||Noah Misch||2011-07-07 02:44:55||Re: Make relation_openrv atomic wrt DDL|
|Previous Message||Robert Haas||2011-07-07 00:37:09||Re: [v9.2] DROP Reworks Part.1 - Consolidate routines to handle DropStmt|