Re: High SYS CPU - need advise

From: Vlad Marchenko <marchenko(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: High SYS CPU - need advise
Date: 2012-11-21 15:29:01
Message-ID: 50ACF33D.20609@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

update on my problem: despite pgbouncer, the problem still occures on my
end.

Also, interesting observation - I ran several tests with pgbench, using
queries that I think are prone to trigger high-sys-cpu-stall. What I
noticed is when pgbench is started with prepared mode, the system
behaves fine during stress-test (high user cpu - 85-90%, low sys cpu -
5-7%), high TPS. Though when I used non-prepared modes, the sys cpu
portion jumps to 40% (and tps drops dramatically as well, but this is
understandable). The test queries are pretty long (10kb+), with couple
of outer joins across 1000-record tables with indexes.

Maybe, we are looking in a wrong place and the issue is somewhere within
planer/parser? Is there some extensive locking used in there?

Another observation - it's harder to trigger high-sys-cpu stall on a
freshly restarted postgresql. Though if it was running for a while, then
it's much easier to do.

-- vlad

On 11/19/12 8:33 AM, Merlin Moncure wrote:
> On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> On Fri, Nov 16, 2012 at 12:13 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
>>> ok, I've applied that patch and ran. The stall started around 13:50:45...50
>>> and lasted until the end
>>>
>>> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>> That isn't as much log as I expected. But I guess only the tip of the
>> iceberg gets logged, it could be that most spinlocks are contended but
>> then get acquired just before the sleep (and log) would have happened.
>>
>> I'm not sure how to figure out just how much spinning is going on
>> below the surface, but my gut feeling is there isn't enough of it to
>> explain the amount of slowdown you are seeing (and it would probably
>> be accounted for as user time rather than system time)
>
> yes. classic spinlock contention symptoms are high user cpu and almost
> work getting done. something else is going on here?
>
>> Maybe we could whip up a dummy standalone program that uses PG's
>> spinlock code in a intentionally contended way and see how your
>> system reports resource usage for that. Maybe your system reports
>> fighting for cache lines or whatever is going on behind the scenes as
>> system time rather than user time, or something.
> I don't think this is the case. It's looking more and more like
> scheduler issues, although it's unclear if that is due to some kind of
> regression or just symptoms of backends piling up. I'm also starting
> to wonder if the spinlock contention we are seeing (which according to
> your patch really isn't all that much really) is just symptomatic of
> scheduler issues.
>
>> It would be a little easier to figure out what was going on there if
>> the log included the pid (%p) so we can see if it is the same process
>> doing a bunch of consecutive sleeps, or a bunch of different
>> processes. Also, logging the millisecond (%m rather than %t) would be
>> helpful.
>>
>> But the way that the point of contention jumps around a lot points to
>> a scheduler issue rather than a postgres-specific issue. A backend
>> gets pre-empted while holding a spinlock, and then the kernel is
>> resistant to rescheduling it even though all other processes are piled
>> up behind it. But again, that theory would lead to more log lines
>> than we see, I would think.
> yeah. I asked for some longer strace logs but then everyone went home
> for the weekend.
>
>> Anyway, pgbouncer might help by ensuring that a connection gets
>> pre-empted by pgbouncer at a point where it has no locks, rather than
>> by the kernel at a point where it does hold locks.
> yes -- I think pgbouncer is the key here because it's going to tell us
> if the scheduler is the culprit here. If you look carefully at the
> strace logs, you'll see that system call times become unpredictable as
> they are getting scheduled out...particularly yielding calls like
> select(). So we need to determine if this is normal high load
> behavior or if there is an underlying mechanic.
>
>> Any chance you could try an older kernel and see if that reduces the problem?
> This is one thing to try. Personally I think pgbouncer/transaction
> mode is low hanging fruit if it 'drops in' to OP's app. If problem
> goes away this will pretty much point the finger at Linux AFAICT (in
> which case we can start messing around with the kernel).
>
> merlin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2012-11-21 15:49:12 Re: High SYS CPU - need advise
Previous Message Granile, Gonzalo A 2012-11-21 15:23:01 Postgres Security Audit/Checklist