Odd blocking (or massively latent) issue - even with EXPLAIN

From: Jim Vanns <james(dot)vanns(at)framestore(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Odd blocking (or massively latent) issue - even with EXPLAIN
Date: 2012-07-23 08:41:45
Message-ID: 1343032905.12579.3.camel@sys367.ldn.framestore.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

(First, apologies if this post now appears twice - it appears our mail server rewrites my address!)

Hello all. I'm a pgsql performance virgin so hope I cross all the 't's
and dot the lower-case 'j's when posting this query...

On our production database server we're experiencing behaviour that
several engineers are unable to explain - hence this Email. First, our
specs;

Scientific Linux 6.2, kernel 2.6.32
PG version 9.1.3, release 1PGDG.rhel6
24GB RAM
8 cores
2x software SSD-based RAIDs:
a) ~660GB, RAID 5, 4 SSDs (data)
b) ~160GB, RAID 1, 2 SSDs (xlogs + tmp tables)

We're seeing SELECT statements and even EXPLAIN (no ANAYLZE)
statements hang indefinitely until *something* (we don't know what)
releases some kind of resource or no longer becomes a massive bottle
neck. These are the symptoms.

However, the system seems healthy - no table ('heavyweight') locks are
held by any session (this happens with only a few connected sessions),
all indexes are used correctly, other transactions are writing data (we
generally only have a few sessions running at a time - perhaps 10) etc.
etc. In fact, we are writing (or bgwriter is), 2-3 hundred MB/s
sometimes.

We regularly run vacuum analyze at quiet periods - generally 1-2s daily.

These sessions (that only read data) that are blocked can block from
anything from between only 5 minutes to 10s of hours then miraculously
complete successfully at once.

Any suggestions for my next avenue of investigation? I'll try and
capture more data by observation next time it happens (it is relatively
intermittent).

Regards,

Jim

PS. These are the settings that differ from the default:

checkpoint_segments = 128
maintenance_work_mem = 256MB
synchronous_commit = off
random_page_cost = 3.0
wal_buffers = 16MB
shared_buffers = 8192MB
checkpoint_completion_target = 0.9
effective_cache_size = 18432MB
work_mem = 32MB
effective_io_concurrency = 12
max_stack_depth = 8MB
log_autovacuum_min_duration = 0
log_lock_waits = on
autovacuum_vacuum_scale_factor = 0.1
autovacuum_naptime = 8
autovacuum_max_workers = 4

PPS. I've just noticed that our memory configuration is over subscribed!
shared_buffers + effective_cache_size > Total available RAM! Could
this be the root cause somehow?

--
Jim Vanns
Systems Programmer
Framestore

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Daniele Varrazzo 2012-07-23 10:03:11 Shards + hash = forever running queries
Previous Message David Fetter 2012-07-23 04:37:33 Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)