Re: Postgres Query Plan Live Lock

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: "Pweaver (Paul Weaver)" <pweaver(at)panjiva(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Michael Vezza <michael(at)panjiva(dot)com>
Subject: Re: Postgres Query Plan Live Lock
Date: 2014-02-11 23:50:38
Message-ID: CAMkU=1ww0JTod6-UYQY2jAYFbA4tdPwYebD8FeUeLrv_kaokOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Feb 5, 2014 at 11:47 AM, Pweaver (Paul Weaver)
<pweaver(at)panjiva(dot)com>wrote:

>
> On Wed, Feb 5, 2014 at 9:52 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>> On Monday, February 3, 2014, Pweaver (Paul Weaver) <pweaver(at)panjiva(dot)com>
>> wrote:
>>
>>> We have been running into a (live lock?) issue on our production
>>> Postgres instance causing queries referencing a particular table to become
>>> extremely slow and our application to lock up.
>>>
>>> This tends to occur on a particular table that gets a lot of queries
>>> against it after a large number of deletes. When this happens, the
>>> following symptoms occur when queries referencing that table are run (even
>>> it we stop the deleting):
>>>
>>
>> What do you mean by "stop the deleting"? Are you pausing the delete but
>> without either committing or rolling back the transaction, but just holding
>> it open? Are you stopping it cleanly, between transactions?
>>
>
> We are repeatedly running delete commands in their own transactions. We
> stop issuing new deletes and let them finish cleanly.
>
>>
>> Also, how many queries are happening concurrently? Perhaps you need a
>> connection pooler.
>>
> Usually between 1 and 20. When it gets locked up closer to 100-200.
> We should add a connection pooler. Would the number of active queries on
> the table be causing the issue?
>

100 to 200 active connections cannot be helpful. That number should not be
*inherently* harmful, but certainly can be very harmful in conjunction with
something else. One thing it could be harmful in conjunction with would be
contention on the PROCLOCK spinlock, but if you don't have open
transactions that have touched a lot of tuples (which it sounds like you do
not) then that probably isn't the case. Another thing could be kernel
scheduler problems. I think some of the early 3-series kernels had some
problems with the scheduler under many concurrently active processes, which
lead to high % system CPU time. There are also problems with NUMA, and
with transparent huge pages, from around the same kernel versions.

>
>> Is the CPU time user time or system time? What kernel version do you
>> have?
>>
> Real time - 3.2.0-26
>

I meant using "top" or "sar" during a lock up, is the CPU time being spent
in %user, or in %system?

Unfortunately I don't know exactly when in the 3-series kernels the
problems showed up, or were fixed.

In any case, lowering the max_connections will probably prevent you from
accidentally poking the beast, even if we can't figure out exactly what
kind of beast it is.

>
>>
>> max_connections | 600
>>> | configuration file
>>>
>>
>> That is quite extreme. If a temporary load spike (like from the deletes
>> and the hinting needed after them) slows down the select queries and you
>> start more and more of them, soon you could tip the system over into kernel
>> scheduler insanity with high system time. Once in this mode, it will stay
>> there until the incoming stream of queries stops and the existing ones
>> clear out. But, if that is what is occurring, I don't know why queries on
>> other tables would still be fast.
>>
> We probably want a connection pooler anyways, but in this particular case,
> the load average is fairly low on the machine running Postrgres.
>

Is the load average low even during the problem event?

Cheers,

Jeff

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Katharina Koobs 2014-02-12 08:58:49 increasing query time after analyze
Previous Message Claudio Freire 2014-02-11 23:36:45 Re: Bloated tables and why is vacuum full the only option