Re: Puzzling full database lock

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Christopher Opena <counterveil(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Puzzling full database lock
Date: 2012-02-02 14:03:05
Message-ID: CAHyXU0zmv=un8ewJCKDqLUjPwFJofGM1+UucTYwczPQ0r9x_XQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Feb 1, 2012 at 6:38 PM, Christopher Opena <counterveil(at)gmail(dot)com> wrote:
> Hello folks,
>
> We've been running into some very strange issues of late with our PostgreSQL
> database(s).  We have an issue where a couple of queries push high CPU on a
> few of our processors and the entire database locks (reads, writes, console
> cannot be achieved unless the high CPU query procs are killed).  Further
> investigation shows ~59% total cpu usage (we have 16 total cores), low io,
> and mid-to-low memory usage (we have 74GB of memory, shared_buffers=16GB).
>  We had previously seen some high io problems but those turned out to be
> unconnected and ultimately solved, yet we are still seeing a complete lock
> of the DB occasionally as previously described.
>
> The queries themselves are not any different than normal usage on other
> databases; they are pulling back a little more data but there's nothing that
> stands out about them as far as query construction.
>
> One thing that we aren't sure of is whether or not we are running into a
> general connection pooling issue.  Our typical number of postgresql
> processes fluctuates between 1,400 and 1,600 - most of which are idle - as
> we have a number of application servers all connecting to a central
> read/write master (the master replicates out to a secondary via streaming
> replication).  We have max_processes set to 3,000 after tweaking some kernel
> memory parameters so at least we know we aren't exceeding that, but is there
> a practical "real world" limit or issue with setting this too high?
>
> Ultimately, the problem we're seeing is a full read/write lock on a system
> that is apparently at medium usage levels once we got rid of our high io red
> herring.  Honestly I'm a little stumped as to where to look next; is there
> some specific metric I might be missing here?
>
> Any help is greatly appreciated,

Random thoughts/suggestions:
*) Look for some correlation between non-idle process count and
locking situation. You are running a lot of processes and if I was in
your shoes I would be strongly looking at pgbouncer to handle
connection pooling. You could be binding in the database or (worse)
the kernel

*) Try logging checkpoints to see if there is any correlation with your locks.

*) An strace of both the 'high cpu' process and one of the blocked
process might give some clues -- in particular if you are being
blocked on a system call

*) Given enough time, do your high cpu queries ever complete? Are they
writing or reading?

*) What happens to overall system load if you lower shared_buffers to, say, 4gb?

merlin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Florian Weimer 2012-02-02 14:08:51 Re: Is it possible to speed up addition of "not null"?
Previous Message hubert depesz lubaczewski 2012-02-02 13:20:59 Re: Is it possible to speed up addition of "not null"?