Re: Update on the spinlock->pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Nils Goroll <slink(at)schokola(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Update on the spinlock->pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux
Date: 2012-07-01 23:07:02
Message-ID: CAMkU=1wJpgDz4Zj0+N+ZFX4B2q5aPULiaTMgeNbaadpafmKtqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 1, 2012 at 2:28 PM, Nils Goroll <slink(at)schokola(dot)de> wrote:
> Hi Jeff,
>
>>>> It looks like the hacked code is slower than the original.  That
>>>> doesn't seem so good to me.  Am I misreading this?
>>>
>>> No, you are right - in a way. This is not about maximizing tps, this is about
>>> maximizing efficiency under load situations
>>
>> But why wouldn't this maximized efficiency present itself as increased TPS?
>
> Because the latency of lock aquision influences TPS, but this is only marginally
> related to the cost in terms of cpu cyclues to aquire the locks.
>
> See my posting as of Sun, 01 Jul 2012 21:02:05 +0200 for an overview of my
> understanding.

I still don't see how improving that could not improve TPS. But let's
focus on reproducing the problem first, otherwise it is all just
talking in the dark.

> But I don't understand yet how to best provoke high spinlock concurrency with
> pgbench. Or are there are any other test tools out there for this case?

Use pgbench -S, or apply my patch from "pgbench--new transaction type"
and then run pgbench -P.

Make sure that the scale is such that all of your data fits in
shared_buffers (I find on 64 bit that pgbench takes about 15MB *
scale)

>> Anyway, your current benchmark speed of around 600 TPS over such a
>> short time periods suggests you are limited by fsyncs.
>
> Definitely. I described the setup in my initial posting ("why roll-your-own
> s_lock? / improving scalability" - Tue, 26 Jun 2012 19:02:31 +0200)

OK. It looks like several things changed simultaneously. How likely
do you think it is that the turning off of the write cache caused the
problem?

>
>> pgbench does as long as that is the case.  You could turn --fsync=off,
>> or just change your benchmark to a read-only one like -S, or better
>> the -P option I've been trying get into pgbench.
>
> I don't like to make assumptions which I haven't validated. The system showing
> the behavior is designed to write to persistent SSD storage in order to reduce
> the risk of data loss by a (BBU) cache failure. Running a test with fsync=off
> would divert even further from reality.

I think that you can't get much farther from reality than your current
benchmarks are, I'm afraid.

If your goal is the get pgbench closer to being limited by spinlock
contention, then fsync=off, or using -S or -P, will certainly do that.

So if you have high confidence that spinlock contention is really the
problem, fsync=off will get you closer to the thing you want to focus
on, even if it takes you further away from the holistic big-picture
production environment. And since you went to the trouble of making
patches for spinlocks, I assume you are fairly confident that that is
the problem.

If you are not confident that spinlocks are really the problem, then I
agree it would be a mistake to try to craft a simple pgbench run which
focuses in on one tiny area which might not actually be the correct
area. In that case, you would instead want to either create a very
complicated workload that closely simulates your production load (a
huge undertaking) or find a way to capture an oprofile of the
production server while it is actually in distress. Also, it would
help if you could get oprofile to do a call graph so you can see which
call sites the contended spin locks are coming from (sorry, I don't
know how to do this successfully with oprofile)

>
>> Does your production server have fast fsyncs (BBU) while your test
>> server does not?
>
> No, we're writing directly to SSDs (ref: initial posting).

OK. So it seems like the pgbench workload you are doing are limited
by fsyncs, and the CPU is basically idle because of that limit. While
your real work load needs a much larger amount of processing power per
fsync, so it is closer to both limits at the same time. But, since
the stats you posted were for the normal rather than the distressed
state, maybe I'm way off here.

Anyway, the easiest way to increase the pgbench "CPU per fsync" need
is to turn of fsync or synchronous_commit, or to switch to read only
queries.

>>> 2       54.4s                2          27.18           SELECT ...
>>
>> That is interesting.  Maybe those two queries are hammering everything
>> else to death.
>
> With 64 cores?

Maybe. That is the nature of spin-locks. The more cores you have,
the more other things each one interferes with. Except that the
duration is not long enough to cover the entire run period. But then
again, maybe in the distressed state those same queries did cover the
entire duration. But yeah, now that I think about it this would not
be my top hypothesis.

>>
>> In other words, how many query-seconds worth of time transpired during
>> the 137 wall seconds?  That would give an estimate of how many
>> simultaneously active connections the production server has.
>
> Sorry, I should have given you the stats from pgFouine:
>
>     Number of unique normalized queries: 507
>     Number of queries: 295,949
>     Total query duration: 8m38s
>     First query: 2012-06-23 14:51:01
>     Last query: 2012-06-23 14:53:17
>     Query peak: 6,532 queries/s at 2012-06-23 14:51:33

A total duration of 518 seconds over 136 seconds of wall time suggests
there is not all that much concurrent activity going on. But maybe
time spent in commit is not counted by pgFouine? But again, these
stats are for the normal state, not the distressed state.

> Thank you very much, Jeff! The one question remains: Do we really have all we
> need to provoke very high lock contention?

I think you do. (I don't have 64 cores...)

Lots of cores, running pgbench -c64 -j64 -P -T60 on a scale that fits
in shared_buffers.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Darren Duncan 2012-07-01 23:54:58 Re: Proof of concept: auto updatable views
Previous Message Dean Rasheed 2012-07-01 22:35:54 Proof of concept: auto updatable views