Re: High SYS CPU - need advise

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Vlad <marchenko(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High SYS CPU - need advise
Date: 2012-11-15 22:16:06
Message-ID: CAHyXU0y4LCV3uTBZp9+5iX-WL5fSQVmrKeQtaGqTPQtVJ7V_DA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Nov 15, 2012 at 3:49 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko(at)gmail(dot)com> wrote:
>>
>>>
>>> yeah. ok, nest steps:
>>> *) can you confirm that postgres process is using high cpu (according
>>> to top) during stall time
>>
>>
>> yes, CPU is spread across a lot of postmasters
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
>> 30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
>> 11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
>> 8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
>> 29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
>> 11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
>> 29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
>> 29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster
>>
>>>
>>> *) if, so, please strace that process and save some of the log
>>
>>
>> https://dl.dropbox.com/u/109778/stall_postmaster.log
>
> ok, excellent. reviewing the log, this immediately caught my eye:
>
> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
> recvfrom(8, "\327\327\nl\231LD\211\346\243(at)WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
> 64, 0, NULL, NULL) = 64
> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout)
> semop(41713721, {{2, 1, 0}}, 1) = 0
> lseek(295, 0, SEEK_END) = 0
> lseek(296, 0, SEEK_END) = 8192
>
> this is definitely pointing to spinlock issue. see: slock.c what you
> are seeing here is a backend getting caught in a spin loop via a stuck
> spinlock. 0-9 = 10 times. did you by any chance check the logs? did
> any backends restart? we are looking for this:
>
> elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
> lock, file, line);
>
> Anything else going on in the log? The way to enable locks status is
> via LWLOCK_STATS macro. But before doing any thing plz check logs for
> error/panic.

ah, scratch that. reading the code again, slock.c allows for for up
to 1 second (that is, you can delay 1000 times). so the panic is
unlikely to be in the log. also spinlock waits shouldn't increase by
exactly 1000 us. investigating...

merlin

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2012-11-15 22:29:50 Re: High SYS CPU - need advise
Previous Message Merlin Moncure 2012-11-15 21:49:42 Re: High SYS CPU - need advise