Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Large shared_buffer stalls WAS: proposal: Set effective_cache_size to greater of .conf value, shared_buffers
Date: 2013-09-13 20:58:54
Message-ID: CAHyXU0yQF-nthU_pEUDig+jtjOBNg+QegNoBHyes0n_xQ2r3bA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 13, 2013 at 3:20 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 09/13/2013 12:55 PM, Merlin Moncure wrote:
>> what are the specific symptoms of your problem? anything interesting
>> in pg_locks? is $client willing to experiment with custom patches?
>
> 3 servers: 1 master, two replicas.
> 32-core Xeon, hyperthreaded to 64 cores
> 512GB RAM each
> s_b set to 8GB
> Load-balanced between all 3
> ~~ 11 different databases
> combined database size around 600GB
> using pgbouncer
>
> Irregularly, during periods of high activity (although not necessarily
> peak activity) one or another of the systems will go into paralysis,
> with all backends apparently waiting on LWLocks (we need more tracing
> information to completely confirm this). Activity at this time is
> usually somewhere between 50 and 100 concurrent queries (and 80 to 150
> connections). pg_locks doesn't think anything is waiting on a lock.
>
> What's notable is that sometimes it's just *one* of the replicas which
> goes into paralysis. If the master gets this issue though, the replicas
> experience it soon afterwards. Increasing wal_buffers from 16GB to 64GB
> seems to make this issue happen less frequently, but it doesn't go away
> entirely. Only a restart of the server, or killing all backend, ends
> the lockup.
>
> The workload is OLTP, essentially, around 20/80 write/read. They use
> PostGIS. The other notable thing about their workload is that due to an
> ORM defect, they get idle-in-transactions which last from 5 to 15
> seconds several times a minute.
>
> They are willing to use experimental patches, but only if those patches
> can be applied only to a replica.

ok, points similar:
*) master/slave config (two slaves for me)
*) 'big' server 256GB mem, 32 core
*) 80% approx. (perhaps more)
*) some spacial searching (but not very much)
*) OLTP
*) presentation of load, although in my case it did resolve anywhere
from 30 secs to half hour
*) aside from the spike, 100% healthy

points different
*) application side pooling: 96 app servers, max 5 connections each
(aside: are you using transaction mode pgbouncer?)
*) I saw gripes about relation extension in pg_locks

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-09-13 21:04:55 Re: proposal: Set effective_cache_size to greater of .conf value, shared_buffers
Previous Message Dimitri Fontaine 2013-09-13 20:40:54 Completing PL support for Event Triggers