Skip site navigation (1) Skip section navigation (2)

Re: spinlocks on HP-UX

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: spinlocks on HP-UX
Date: 2011-08-28 23:19:57
Message-ID: 22039.1314573597@sss.pgh.pa.us (view raw or flat)
Thread:
Lists: pgsql-hackers
I wrote:
> Yeah, I figured out that was probably what you meant a little while
> later.  I found a 64-CPU IA64 machine in Red Hat's test labs and am
> currently trying to replicate your results; report to follow.

OK, these results are on a 64-processor SGI IA64 machine (AFAICT, 64
independent sockets, no hyperthreading or any funny business); 124GB
in 32 NUMA nodes; running RHEL5.7, gcc 4.1.2.  I built today's git
head with --enable-debug (but not --enable-cassert) and ran with all
default configuration settings except shared_buffers = 8GB and
max_connections = 200.  The test database is initialized at -s 100.
I did not change the database between runs, but restarted the postmaster
and then did this to warm the caches a tad:

pgbench -c 1 -j 1 -S -T 30 bench

Per-run pgbench parameters are as shown below --- note in particular
that I assigned one pgbench thread per 8 backends.

The numbers are fairly variable even with 5-minute runs; I did each
series twice so you could get a feeling for how much.

Today's git head:

pgbench -c 1 -j 1 -S -T 300 bench	tps = 5835.213934 (including ...
pgbench -c 2 -j 1 -S -T 300 bench	tps = 8499.223161 (including ...
pgbench -c 8 -j 1 -S -T 300 bench	tps = 15197.126952 (including ...
pgbench -c 16 -j 2 -S -T 300 bench	tps = 30803.255561 (including ...
pgbench -c 32 -j 4 -S -T 300 bench	tps = 65795.356797 (including ...
pgbench -c 64 -j 8 -S -T 300 bench	tps = 81644.914241 (including ...
pgbench -c 96 -j 12 -S -T 300 bench	tps = 40059.202836 (including ...
pgbench -c 128 -j 16 -S -T 300 bench	tps = 21309.615001 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench	tps = 5787.310115 (including ...
pgbench -c 2 -j 1 -S -T 300 bench	tps = 8747.104236 (including ...
pgbench -c 8 -j 1 -S -T 300 bench	tps = 14655.369995 (including ...
pgbench -c 16 -j 2 -S -T 300 bench	tps = 28287.254924 (including ...
pgbench -c 32 -j 4 -S -T 300 bench	tps = 61614.715187 (including ...
pgbench -c 64 -j 8 -S -T 300 bench	tps = 79754.640518 (including ...
pgbench -c 96 -j 12 -S -T 300 bench	tps = 40334.994324 (including ...
pgbench -c 128 -j 16 -S -T 300 bench	tps = 23285.271257 (including ...

With modified TAS macro (see patch 1 below):

pgbench -c 1 -j 1 -S -T 300 bench	tps = 6171.454468 (including ...
pgbench -c 2 -j 1 -S -T 300 bench	tps = 8709.003728 (including ...
pgbench -c 8 -j 1 -S -T 300 bench	tps = 14902.731035 (including ...
pgbench -c 16 -j 2 -S -T 300 bench	tps = 29789.744482 (including ...
pgbench -c 32 -j 4 -S -T 300 bench	tps = 59991.549128 (including ...
pgbench -c 64 -j 8 -S -T 300 bench	tps = 117369.287466 (including ...
pgbench -c 96 -j 12 -S -T 300 bench	tps = 112583.144495 (including ...
pgbench -c 128 -j 16 -S -T 300 bench	tps = 110231.305282 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench	tps = 5670.097936 (including ...
pgbench -c 2 -j 1 -S -T 300 bench	tps = 8230.786940 (including ...
pgbench -c 8 -j 1 -S -T 300 bench	tps = 14785.952481 (including ...
pgbench -c 16 -j 2 -S -T 300 bench	tps = 29335.875139 (including ...
pgbench -c 32 -j 4 -S -T 300 bench	tps = 59605.433837 (including ...
pgbench -c 64 -j 8 -S -T 300 bench	tps = 108884.294519 (including ...
pgbench -c 96 -j 12 -S -T 300 bench	tps = 110387.439978 (including ...
pgbench -c 128 -j 16 -S -T 300 bench	tps = 109046.121191 (including ...

With unlocked test in s_lock.c delay loop only (patch 2 below):

pgbench -c 1 -j 1 -S -T 300 bench	tps = 5426.491088 (including ...
pgbench -c 2 -j 1 -S -T 300 bench	tps = 8787.939425 (including ...
pgbench -c 8 -j 1 -S -T 300 bench	tps = 15720.801359 (including ...
pgbench -c 16 -j 2 -S -T 300 bench	tps = 33711.102718 (including ...
pgbench -c 32 -j 4 -S -T 300 bench	tps = 61829.180234 (including ...
pgbench -c 64 -j 8 -S -T 300 bench	tps = 109781.655020 (including ...
pgbench -c 96 -j 12 -S -T 300 bench	tps = 107132.848280 (including ...
pgbench -c 128 -j 16 -S -T 300 bench	tps = 106533.630986 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench	tps = 5705.283316 (including ...
pgbench -c 2 -j 1 -S -T 300 bench	tps = 8442.798662 (including ...
pgbench -c 8 -j 1 -S -T 300 bench	tps = 14423.723837 (including ...
pgbench -c 16 -j 2 -S -T 300 bench	tps = 29112.751995 (including ...
pgbench -c 32 -j 4 -S -T 300 bench	tps = 62258.984033 (including ...
pgbench -c 64 -j 8 -S -T 300 bench	tps = 107741.988800 (including ...
pgbench -c 96 -j 12 -S -T 300 bench	tps = 107138.968981 (including ...
pgbench -c 128 -j 16 -S -T 300 bench	tps = 106110.215138 (including ...

So this pretty well confirms Robert's results, in particular that all of
the win from an unlocked test comes from using it in the delay loop.
Given the lack of evidence that a general change in TAS() is beneficial,
I'm inclined to vote against it, on the grounds that the extra test is
surely a loss at some level when there is not contention.
(IOW, +1 for inventing a second macro to use in the delay loop only.)

We ought to do similar tests on other architectures.  I found some
lots-o-processors x86_64 machines at Red Hat, but they don't seem to
own any PPC systems with more than 8 processors.  Anybody have big
iron with other non-Intel chips?

			regards, tom lane


Patch 1: change TAS globally, non-HPUX code:

*** src/include/storage/s_lock.h.orig	Sat Jan  1 13:27:24 2011
--- src/include/storage/s_lock.h	Sun Aug 28 13:32:47 2011
***************
*** 228,233 ****
--- 228,240 ----
  {
  	long int	ret;
  
+ 	/*
+ 	 * Use a non-locking test before the locking instruction proper.  This
+ 	 * appears to be a very significant win on many-core IA64.
+ 	 */
+ 	if (*lock)
+ 		return 1;
+ 
  	__asm__ __volatile__(
  		"	xchg4 	%0=%1,%2	\n"
  :		"=r"(ret), "+m"(*lock)
***************
*** 243,248 ****
--- 250,262 ----
  {
  	int		ret;
  
+ 	/*
+ 	 * Use a non-locking test before the locking instruction proper.  This
+ 	 * appears to be a very significant win on many-core IA64.
+ 	 */
+ 	if (*lock)
+ 		return 1;
+ 
  	ret = _InterlockedExchange(lock,1);	/* this is a xchg asm macro */
  
  	return ret;

Patch 2: change s_lock only (same as Robert's quick hack):

*** src/backend/storage/lmgr/s_lock.c.orig	Sat Jan  1 13:27:09 2011
--- src/backend/storage/lmgr/s_lock.c	Sun Aug 28 14:02:29 2011
***************
*** 96,102 ****
  	int			delays = 0;
  	int			cur_delay = 0;
  
! 	while (TAS(lock))
  	{
  		/* CPU-specific delay each time through the loop */
  		SPIN_DELAY();
--- 96,102 ----
  	int			delays = 0;
  	int			cur_delay = 0;
  
! 	while (*lock ? 1 : TAS(lock))
  	{
  		/* CPU-specific delay each time through the loop */
  		SPIN_DELAY();

In response to

Responses

pgsql-hackers by date

Next:From: Robert HaasDate: 2011-08-28 23:49:18
Subject: Re: spinlocks on HP-UX
Previous:From: Andrew DunstanDate: 2011-08-28 22:38:43
Subject: Re: Why buildfarm member anchovy is failing on 8.2 and 8.3 branches

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group