spinlocks on HP-UX

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: spinlocks on HP-UX
Date: 2011-08-28 00:09:08
Message-ID: CA+TgmoZvATZV+eLh3U35jaNnwwzLL5ewUU_-t0X=T0Qwas+ZdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was able to obtain access to a 32-core HP-UX server. I repeated the
pgbench -S testing that I have previously done on Linux, and found
that the results were not too good. Here are the results at scale
factor 100, on 9.2devel, with various numbers of clients. Five minute
runs, shared_buffers=8GB.

1:tps = 5590.070816 (including connections establishing)
8:tps = 37660.233932 (including connections establishing)
16:tps = 67366.099286 (including connections establishing)
32:tps = 82781.624665 (including connections establishing)
48:tps = 18589.995074 (including connections establishing)
64:tps = 16424.661371 (including connections establishing)

And just for comparison, here are the numbers at scale factor 1000:

1:tps = 4751.768608 (including connections establishing)
8:tps = 33621.474490 (including connections establishing)
16:tps = 58959.043171 (including connections establishing)
32:tps = 78801.265189 (including connections establishing)
48:tps = 21635.234969 (including connections establishing)
64:tps = 18611.863567 (including connections establishing)

After mulling over the vmstat output for a bit, I began to suspect
spinlock contention. I took a look at document called "Implementing
Spinlocks on the Intel Itanium Architecture and PA-RISC", by Tor
Ekqvist and David Graves and available via the HP web site, which
states that when spinning on a spinlock on these machines, you should
use a regular, unlocked test first and use the atomic test only when
the unlocked test looks OK. I tried implementing this in two ways,
and both produced results which are FAR superior to our current
implementation. First, I did this:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -726,7 +726,7 @@ tas(volatile slock_t *lock)
typedef unsigned int slock_t;

#include <ia64/sys/inline.h>
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
+#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))

#endif /* HPUX on IA64, non gcc */

That resulted in these numbers. Scale factor 100:

1:tps = 5569.911714 (including connections establishing)
8:tps = 37365.364468 (including connections establishing)
16:tps = 63596.261875 (including connections establishing)
32:tps = 95948.157678 (including connections establishing)
48:tps = 90708.253920 (including connections establishing)
64:tps = 100109.065744 (including connections establishing)

Scale factor 1000:

1:tps = 4878.332996 (including connections establishing)
8:tps = 33245.469907 (including connections establishing)
16:tps = 56708.424880 (including connections establishing)
48:tps = 69652.232635 (including connections establishing)
64:tps = 70593.208637 (including connections establishing)

Then, I did this:

--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -96,7 +96,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)
int delays = 0;
int cur_delay = 0;

- while (TAS(lock))
+ while (*lock ? 1 : TAS(lock))
{
/* CPU-specific delay each time through the loop */
SPIN_DELAY();

That resulted in these numbers, at scale factor 100:

1:tps = 5564.059494 (including connections establishing)
8:tps = 37487.090798 (including connections establishing)
16:tps = 66061.524760 (including connections establishing)
32:tps = 96535.523905 (including connections establishing)
48:tps = 92031.618360 (including connections establishing)
64:tps = 106813.631701 (including connections establishing)

And at scale factor 1000:

1:tps = 4980.338246 (including connections establishing)
8:tps = 33576.680072 (including connections establishing)
16:tps = 55618.677975 (including connections establishing)
32:tps = 73589.442746 (including connections establishing)
48:tps = 70987.026228 (including connections establishing)

Note sure why I am missing the 64-client results for that last set of
tests, but no matter.

Of course, we can't apply the second patch as it stands, because I
tested it on x86 and it loses. But it seems pretty clear we need to
do it at least for this architecture...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-08-28 01:47:01 Re: cheaper snapshots redux
Previous Message Robert Haas 2011-08-27 23:39:48 Re: Inputting relative datetimes