Re: Spinlock performance improvement proposal

From: Neil Padgett <npadgett(at)redhat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spinlock performance improvement proposal
Date: 2001-09-26 18:46:16
Message-ID: 3BB22278.5F5F37DF@redhat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
>
> At the just-past OSDN database conference, Bruce and I were annoyed by
> some benchmark results showing that Postgres performed poorly on an
> 8-way SMP machine. Based on past discussion, it seems likely that the
> culprit is the known inefficiency in our spinlock implementation.
> After chewing on it for awhile, we came up with an idea for a solution.
>
> The following proposal should improve performance substantially when
> there is contention for a lock, but it creates no portability risks
> because it uses the same system facilities (TAS and SysV semaphores)
> that we have always relied on. Also, I think it'd be fairly easy to
> implement --- I could probably get it done in a day.
>
> Comments anyone?

We have been doing some scalability testing just recently here at Red
Hat. The machine I was using was a 4-way 550 MHz Xeon SMP machine, I
also ran the machine in uniprocessor mode to make some comparisons. All
runs were made on Red Hat Linux running 2.4.x series kernels. I've
examined a number of potentially interesting cases -- I'm still
analyzing the results, but some of the initial results might be
interesting:

- We have tried benchmarking the following: TAS spinlocks (existing
implementation), SysV semaphores (existing implementation), Pthread
Mutexes. Pgbench runs were conducted for 1 to 512 simultaneous backends.

For these three cases we found:
- TAS spinlocks fared the best of all three lock types, however above
100 clients the Pthread mutexes were lock step in performance. I expect
this is due to the cost of any system calls being negligible
relative to lock wait time.
- SysV semaphore implementation faired terribly as expected. However,
it is worse, relative to the TAS spinlocks on SMP than on uniprocessor.

- Since the above seemed to indicate that the lock implementation may
not be the problem (Pthread mutexes are supposed to be implemented to be
less bang-bang than the Postgres TAS spinlocks, IIRC), I decided to
profile Postgres. After much trouble, I got results for it using
oprofile, a kernel profiler for Linux. Unfortunately, I can only profile
for uniprocessor right now using oprofile, as it doesn't support SMP
boxes yet. (soon, I hope.)

Initial results (top five -- if you would like a complete profile, let
me know):
Each sample counts as 1 samples.
% cumulative self self total
time samples samples calls T1/call T1/call name
26.57 42255.02 42255.02
FindLockCycleRecurse
5.55 51081.02 8826.00 s_lock_sleep
5.07 59145.03 8064.00 heapgettup
4.48 66274.03 7129.00 hash_search
4.48 73397.03 7123.00 s_lock
2.85 77926.03 4529.00
HeapTupleSatisfiesSnapshot
2.07 81217.04 3291.00 SHMQueueNext
1.85 84154.04 2937.00 AllocSetAlloc
1.84 87085.04 2931.00 fmgr_isbuiltin
1.64 89696.04 2611.00 set_ps_display
1.51 92101.04 2405.00 FunctionCall2
1.47 94442.04 2341.00 XLogInsert
1.39 96649.04 2207.00 _bt_compare
1.22 98597.04 1948.00 SpinAcquire
1.22 100544.04 1947.00 LockBuffer
1.21 102469.04 1925.00 tag_hash
1.01 104078.05 1609.00 LockAcquire
.
.
.

(The samples are proportional to execution time.)

This would seem to point to the deadlock detector. (Which some have
fingered as a possible culprit before, IIRC.)

However, this seems to be a red herring. Removing the deadlock detector
had no effect. In fact, benchmarking showed removing it yielded no
improvement in transaction processing rate on uniprocessor or SMP
systems. Instead, it seems that the deadlock detector simply amounts to
"something to do" for the blocked backend while it waits for lock
acquisition.

Profiling bears this out:

Flat profile:

Each sample counts as 1 samples.
% cumulative self self total
time samples samples calls T1/call T1/call name
12.38 14112.01 14112.01 s_lock_sleep
10.18 25710.01 11598.01 s_lock
6.47 33079.01 7369.00 hash_search
5.88 39784.02 6705.00 heapgettup
5.32 45843.02 6059.00
HeapTupleSatisfiesSnapshot
2.62 48830.02 2987.00 AllocSetAlloc
2.48 51654.02 2824.00 fmgr_isbuiltin
1.89 53813.02 2159.00 XLogInsert
1.86 55938.02 2125.00 _bt_compare
1.72 57893.03 1955.00 SpinAcquire
1.61 59733.03 1840.00 LockBuffer
1.60 61560.03 1827.00 FunctionCall2
1.56 63339.03 1779.00 tag_hash
1.46 65007.03 1668.00 set_ps_display
1.20 66372.03 1365.00 SearchCatCache
1.14 67666.03 1294.00 LockAcquire
.
.
.

Our current suspicion isn't that the lock implementation is the only
problem (though there is certainly room for improvement), or perhaps
isn't even the main problem. For example, there has been some suggestion
that perhaps some component of the database is causing large lock
contention. My opinion is that rather than guessing and taking stabs in
the dark, we need to take a more reasoned approach to these things.
IMHO, the next step should be to apply instrumentation (likely via some
neat macros) to all lock acquires / releases. Then, it will be possible
to determine what components are the greatest consumers of locks, and to
determine whether it is a component problem or a systemic problem. (i.e.
some component vs. simply just the lock implementation.)

Neil

--
Neil Padgett
Red Hat Canada Ltd. E-Mail: npadgett(at)redhat(dot)com
2323 Yonge Street, Suite #300,
Toronto, ON M4P 2C9

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Doug McNaught 2001-09-26 19:18:18 Re: Spinlock performance improvement proposal
Previous Message D. Hageman 2001-09-26 18:40:46 Re: Spinlock performance improvement proposal