Re: spinlocks on HP-UX

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: spinlocks on HP-UX
Date: 2011-08-28 23:19:57
Message-ID: 22039.1314573597@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Yeah, I figured out that was probably what you meant a little while
> later. I found a 64-CPU IA64 machine in Red Hat's test labs and am
> currently trying to replicate your results; report to follow.

OK, these results are on a 64-processor SGI IA64 machine (AFAICT, 64
independent sockets, no hyperthreading or any funny business); 124GB
in 32 NUMA nodes; running RHEL5.7, gcc 4.1.2. I built today's git
head with --enable-debug (but not --enable-cassert) and ran with all
default configuration settings except shared_buffers = 8GB and
max_connections = 200. The test database is initialized at -s 100.
I did not change the database between runs, but restarted the postmaster
and then did this to warm the caches a tad:

pgbench -c 1 -j 1 -S -T 30 bench

Per-run pgbench parameters are as shown below --- note in particular
that I assigned one pgbench thread per 8 backends.

The numbers are fairly variable even with 5-minute runs; I did each
series twice so you could get a feeling for how much.

Today's git head:

pgbench -c 1 -j 1 -S -T 300 bench tps = 5835.213934 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8499.223161 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 15197.126952 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 30803.255561 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 65795.356797 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 81644.914241 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 40059.202836 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 21309.615001 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench tps = 5787.310115 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8747.104236 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14655.369995 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 28287.254924 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 61614.715187 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 79754.640518 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 40334.994324 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 23285.271257 (including ...

With modified TAS macro (see patch 1 below):

pgbench -c 1 -j 1 -S -T 300 bench tps = 6171.454468 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8709.003728 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14902.731035 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 29789.744482 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 59991.549128 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 117369.287466 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 112583.144495 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 110231.305282 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench tps = 5670.097936 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8230.786940 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14785.952481 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 29335.875139 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 59605.433837 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 108884.294519 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 110387.439978 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 109046.121191 (including ...

With unlocked test in s_lock.c delay loop only (patch 2 below):

pgbench -c 1 -j 1 -S -T 300 bench tps = 5426.491088 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8787.939425 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 15720.801359 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 33711.102718 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 61829.180234 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 109781.655020 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 107132.848280 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 106533.630986 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench tps = 5705.283316 (including ...
pgbench -c 2 -j 1 -S -T 300 bench tps = 8442.798662 (including ...
pgbench -c 8 -j 1 -S -T 300 bench tps = 14423.723837 (including ...
pgbench -c 16 -j 2 -S -T 300 bench tps = 29112.751995 (including ...
pgbench -c 32 -j 4 -S -T 300 bench tps = 62258.984033 (including ...
pgbench -c 64 -j 8 -S -T 300 bench tps = 107741.988800 (including ...
pgbench -c 96 -j 12 -S -T 300 bench tps = 107138.968981 (including ...
pgbench -c 128 -j 16 -S -T 300 bench tps = 106110.215138 (including ...

So this pretty well confirms Robert's results, in particular that all of
the win from an unlocked test comes from using it in the delay loop.
Given the lack of evidence that a general change in TAS() is beneficial,
I'm inclined to vote against it, on the grounds that the extra test is
surely a loss at some level when there is not contention.
(IOW, +1 for inventing a second macro to use in the delay loop only.)

We ought to do similar tests on other architectures. I found some
lots-o-processors x86_64 machines at Red Hat, but they don't seem to
own any PPC systems with more than 8 processors. Anybody have big
iron with other non-Intel chips?

regards, tom lane

Patch 1: change TAS globally, non-HPUX code:

*** src/include/storage/s_lock.h.orig Sat Jan 1 13:27:24 2011
--- src/include/storage/s_lock.h Sun Aug 28 13:32:47 2011
***************
*** 228,233 ****
--- 228,240 ----
{
long int ret;

+ /*
+ * Use a non-locking test before the locking instruction proper. This
+ * appears to be a very significant win on many-core IA64.
+ */
+ if (*lock)
+ return 1;
+
__asm__ __volatile__(
" xchg4 %0=%1,%2 \n"
: "=r"(ret), "+m"(*lock)
***************
*** 243,248 ****
--- 250,262 ----
{
int ret;

+ /*
+ * Use a non-locking test before the locking instruction proper. This
+ * appears to be a very significant win on many-core IA64.
+ */
+ if (*lock)
+ return 1;
+
ret = _InterlockedExchange(lock,1); /* this is a xchg asm macro */

return ret;

Patch 2: change s_lock only (same as Robert's quick hack):

*** src/backend/storage/lmgr/s_lock.c.orig Sat Jan 1 13:27:09 2011
--- src/backend/storage/lmgr/s_lock.c Sun Aug 28 14:02:29 2011
***************
*** 96,102 ****
int delays = 0;
int cur_delay = 0;

! while (TAS(lock))
{
/* CPU-specific delay each time through the loop */
SPIN_DELAY();
--- 96,102 ----
int delays = 0;
int cur_delay = 0;

! while (*lock ? 1 : TAS(lock))
{
/* CPU-specific delay each time through the loop */
SPIN_DELAY();

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-08-28 23:49:18 Re: spinlocks on HP-UX
Previous Message Andrew Dunstan 2011-08-28 22:38:43 Re: Why buildfarm member anchovy is failing on 8.2 and 8.3 branches