Update on the spinlock->pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

From: Nils Goroll <slink(at)schokola(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Update on the spinlock->pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux
Date: 2012-06-29 17:07:11
Message-ID: 4FEDE0BF.7080203@schokola.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I'll reply to Jeff with a brief thank you to Robert the bottom.

First of all, here's an update:

I have slightly modified the patch, I'll attach what I have at the moment. The
main difference are

- loops around the pthread_mutex calls: As the locking function signature is to
return void at the moment, there is no error handling code in the callers
(so, theoretically, there should be a chance for an infinite loop on a
spinlock in the current code if you SIGKILL a spinlock holder (which you
shouldn't do, sure). Using robust mutexes, we could avoid this issue).

Retrying is probably the best we can do without implementing error recovery
in all callers.

- ereport(FATAL,"") instead of assertions, which is really what we should do
(imagine setting PTHREAD_PROCESS_SHARED fails and we still start up)

Some insights:

- I noticed that, for the simple pgbench tests I ran, PTHREAD_MUTEX_ADAPTIVE_NP
yielded worse results than PTHREAD_MUTEX_NORMAL, which is somehow counter-
intuitive, because _ADAPTIVE is closer to the current spinlock logic, but
yet syscalling in the first place seems to be more efficient than spinning
a little first and then syscalling (for the contended case).

The increase in usr/sys time for my tests was in the order of 10-20%.

- Also I noticed a general issue with linking to libpthread: My understanding is
that this should also change the code to be reentrant when compiling with
gcc (does anyone know precisely?), which we don't need - we only need the
locking code, unless we want to roll our own futex implementation (see below).

I am not sure if this is really root-caused because I have not fully
understood what is going on, but when compiling with LDFLAGS=-lpthread
for the top level Makefile, usr increses by some 10% for my tests.

The code is more efficient when I simply leave out -lpthread, libpthread
gets linked anyway.

- I had a look at futex sample code, for instance
http://locklessinc.com/articles/mutex_cv_futex/ and Ulrichs paper but I must say
at this point
I don't feel ready to roll own futex code for this most critical piece of
code. There is simply too much which can go wrong and major mistakes are very
hard to spot.

I'd very much prefer to use an existing, proven implementation.

At this point, I'd guess pulling in the relevant code from glibc/nptl
would be one of the safest bets, but even this path is risky.

On benchmarks:

With the same pgbench parameters as before, I ended up with comparable results
for unpatched and patched in terms of resource consumption:

Test setup for both:

for i in {1..10} ; do
rsync -av --delete /tmp/test_template_data/ /tmp/data/
/usr/bin/time ./postgres -D /tmp/data -p 55502 & ppid=$!
pid=$(pgrep -P $ppid)
sleep 15
./pgbench -c 256 -t 20 -j 128 -p 55502 postgres
kill $pid
wait $ppid
wait
while pgrep -f 55502 ; do
echo procs still running - hm
sleep 1
done
done

unpatched (bins postgresql-server-91-9.1.3-1PGDG.rhel6.rpm)

-bash-4.1$ grep elapsed /var/tmp/20120627_noslock_check/orig_code_2_perf
34.55user 20.07system 0:25.63elapsed 213%CPU (0avgtext+0avgdata 1360688maxresident)k
35.26user 19.90system 0:25.38elapsed 217%CPU (0avgtext+0avgdata 1360704maxresident)k
38.04user 21.68system 0:26.24elapsed 227%CPU (0avgtext+0avgdata 1360704maxresident)k
36.72user 21.95system 0:27.21elapsed 215%CPU (0avgtext+0avgdata 1360688maxresident)k
37.19user 22.00system 0:26.44elapsed 223%CPU (0avgtext+0avgdata 1360704maxresident)k
37.88user 22.58system 0:25.70elapsed 235%CPU (0avgtext+0avgdata 1360704maxresident)k
35.70user 20.90system 0:25.63elapsed 220%CPU (0avgtext+0avgdata 1360688maxresident)k
40.24user 21.65system 0:26.02elapsed 237%CPU (0avgtext+0avgdata 1360688maxresident)k
44.93user 22.96system 0:26.38elapsed 257%CPU (0avgtext+0avgdata 1360704maxresident)k
38.10user 21.51system 0:26.66elapsed 223%CPU (0avgtext+0avgdata 1360688maxresident)k
-bash-4.1$ grep elapsed /var/tmp/20120627_noslock_check/orig_code_2_perf | tail
-10 | sed 's:[^0-9. ]::g' | awk '{ u+=$1; s+=$2; c++;} END { print "avg " u/c "
" s/c; }'
avg 37.861 21.52

patched (based upon modified source rpm of the above)

-bash-4.1$ egrep elapsed
/var/tmp/20120627_noslock_check/with_slock_6_nocompile_without_top_-lpthread
42.32user 27.16system 0:28.18elapsed 246%CPU (0avgtext+0avgdata 2003488maxresident)k
39.14user 26.31system 0:27.24elapsed 240%CPU (0avgtext+0avgdata 2003504maxresident)k
38.81user 26.17system 0:26.67elapsed 243%CPU (0avgtext+0avgdata 2003520maxresident)k
41.04user 27.80system 0:29.00elapsed 237%CPU (0avgtext+0avgdata 2003520maxresident)k
35.41user 22.85system 0:27.15elapsed 214%CPU (0avgtext+0avgdata 2003504maxresident)k
32.74user 21.87system 0:25.62elapsed 213%CPU (0avgtext+0avgdata 2003504maxresident)k
35.68user 24.86system 0:27.16elapsed 222%CPU (0avgtext+0avgdata 2003520maxresident)k
32.10user 20.18system 0:27.26elapsed 191%CPU (0avgtext+0avgdata 2003504maxresident)k
31.32user 18.67system 0:26.95elapsed 185%CPU (0avgtext+0avgdata 2003488maxresident)k
29.99user 19.78system 0:32.08elapsed 155%CPU (0avgtext+0avgdata 2003504maxresident)k
-bash-4.1$ egrep elapsed
/var/tmp/20120627_noslock_check/with_slock_6_nocompile_without_top_-lpthread |
sed 's:[^0-9. ]::g' | awk '{ u+=$1; s+=$2; c++;} END { print "avg " u/c " " s/c; }'
avg 35.855 23.565

Hopefully I will get a chance to run this in production soon, unless I get
feedback from anyone with reasons why I shouldn't do this.

On 06/28/12 05:21 PM, Jeff Janes wrote:

> It looks like the hacked code is slower than the original. That
> doesn't seem so good to me. Am I misreading this?

No, you are right - in a way. This is not about maximizing tps, this is about
maximizing efficiency under load situations which I can't even simulate at the
moment. So What I am looking for are "comparable" resource consumption and
"comparable" tps - but no risk for concurrent spins on locks.

For minimal contention, using pthread_ functions _must_ be slightly slower than
the current s_lock spin code, but they _should_ scale *much* better at high
contention.

The tps values I got for the runs mentioned above are:

## original code
# egrep ^tps orig_code_2_perf | grep excl | tail -10 | tee /dev/tty | awk '{ a+=
$3; c++; } END { print a/c; }'
tps = 607.241375 (excluding connections establishing)
tps = 622.255763 (excluding connections establishing)
tps = 615.397928 (excluding connections establishing)
tps = 632.821217 (excluding connections establishing)
tps = 620.415654 (excluding connections establishing)
tps = 611.083542 (excluding connections establishing)
tps = 631.301615 (excluding connections establishing)
tps = 612.337597 (excluding connections establishing)
tps = 606.433209 (excluding connections establishing)
tps = 574.031095 (excluding connections establishing)
613.332

## patched code
# egrep ^tps with_slock_6_nocompile_without_top_-lpthread | grep excl | tail -10
| tee /dev/tty | awk '{ a+= $3; c++; } END { print a/c; }'
tps = 584.761390 (excluding connections establishing)
tps = 620.994437 (excluding connections establishing)
tps = 630.983695 (excluding connections establishing)
tps = 502.116770 (excluding connections establishing)
tps = 595.879789 (excluding connections establishing)
tps = 679.814563 (excluding connections establishing)
tps = 655.053339 (excluding connections establishing)
tps = 603.453768 (excluding connections establishing)
tps = 679.481280 (excluding connections establishing)
tps = 440.999884 (excluding connections establishing)
599.354

> Also, 20 transactions per connection is not enough of a run to make
> any evaluation on.

As you can see I've repeated the tests 10 times. I've tested slight variations
as mentioned above, so I was looking for quick results with acceptable variation.

> How many cores are you testing on?

64 x AMD64 1.6GHz (4x6262HE in one box)

>> Regarding the actual production issue, I did not manage to synthetically provoke
>> the saturation we are seeing in production using pgbench - I could not even get
>> anywhere near the production load.
>
> What metrics/tools are you using to compare the two loads?

We've got cpu + load avg statistics for the old+new machine and compared values
before/after the migration. The user load presumably is comparable and the main
metric is "users complaining" vs. "users happy".

I wish we had a synthetic benchmark close to the actual load, and I hope that
one of the insights from this will be that the customer should have one.

During what I believe is an overload situation with very high lock contention,
the load avg rises well above 300 and usr+sys well above 80%.

The temporary relief was to move some databases off to other machines.
Interestingly, moving away <10% of the load returned the system to a well
behaved state with usr+sys in the order of 20-30%, which is the main reason why
I believe that this must be a negative scalability issue for situations beyond
some saturation point determined by concurrency on locks.

> What is the production load like?

Here's an anonymized excerpt from a pgFouine analysis of 137 seconds worth of
query logs at "average production user load".

Type Count Percentage
SELECT 80,217 27.1
INSERT 6,248 2.1
UPDATE 37,159 12.6
DELETE 4,579 1.5

Queries that took up the most time (N) ^

Rank Total duration Times executed Av. duration s Query
1 3m39s 83,667 0.00 COMMIT;
2 54.4s 2 27.18 SELECT ...
3 41.1s 281 0.15 UPDATE ...
4 25.4s 18,960 0.00 UPDATE ...
5 21.9s ...

the 9th rank is already below 10 seconds Total duration

> Each transaction has to update one of ten pgbench_branch rows, so you
> can't have more than ten transactions productively active at any given
> time, even though you have 768 connections. So you need to jack up
> the pgbench scale, or switch to using -N mode.

Sorry for having omitted that detail. I had initialized pgbench with -i -s 100

> Also, you should use -M prepared, otherwise you spend more time
> parsing and planning the statements than executing them.

Ah, good point, thank you. As you will have noticed, I don't have years worth of
background with pgbench yet.

On 06/28/12 05:29 PM, Robert Haas wrote:

> FWIW, I kicked off a looong benchmarking run on this a couple of days
> ago on the IBM POWER7 box, testing pgbench -S, regular pgbench, and
> pgbench --unlogged-tables at various client counts with and without
> the patch; three half-hour test runs for each test configuration. It
> should be done tonight and I will post the results once they're in.

Sounds great! I am really curious.

Nils

Attachment Content-Type Size
0001-experimental-use-pthread-mutexes-instead-of-spinloc.patch text/plain 0 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2012-06-29 17:08:09 Re: Posix Shared Mem patch
Previous Message Alvaro Herrera 2012-06-29 17:01:30 Re: pg_upgrade log files