Intermittent hangs with 9.2

From: David Whittaker <dave(at)iradix(dot)com>
To: PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Intermittent hangs with 9.2
Date: 2013-09-10 15:04:21
Message-ID: CABXnLXS0ez1PW6JQbPdd9rs1GnNuwjyn8Zq48XRdZpx8cdRcLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi All,

I've been seeing a strange issue with our Postgres install for about a year
now, and I was hoping someone might be able to help point me at the cause.
At what seem like fairly random intervals Postgres will become unresponsive
to the 3 application nodes it services. These periods tend to last for 10 -
15 minutes before everything rights itself and the system goes back to
normal.

During these periods the server will report a spike in the outbound
bandwidth (from about 1mbs to about 5mbs most recently), a huge spike in
context switches / interrupts (normal peaks are around 2k/8k respectively,
and during these periods they‘ve gone to 15k/22k), and a load average of
100+. CPU usage stays relatively low, but it’s all system time reported,
user time goes to zero. It doesn‘t seem to be disk related since we’re
running with a shared_buffers setting of 24G, which will fit just about our
entire database into memory, and the IO transactions reported by the
server, as well as the disk reads reported by Postgres stay consistently
low.

We‘ve recently started tracking how long statements take to execute, and
we’re seeing some really odd numbers. A simple delete by primary key, for
example, from a table that contains about 280,000 rows, reportedly took
18h59m46.900s. An update by primary key in that same table was reported as
7d 17h 58m 30.415s. That table is frequently accessed, but obviously those
numbers don't seem reasonable at all.

Some other changes we've made to postgresql.conf:

synchronous_commit = off

maintenance_work_mem = 1GB
wal_level = hot_standby
wal_buffers = 16MB

max_wal_senders = 10

wal_keep_segments = 5000

checkpoint_segments = 128

checkpoint_timeout = 30min

checkpoint_completion_target = 0.9

max_connections = 500

The server is a Dell Poweredge R900 with 4 Xeon E7430 processors, 48GB of
RAM, running Cent OS 6.3.

So far we‘ve tried disabling Transparent Huge Pages after I found a number
of resources online that indicated similar interrupt/context switch issues,
but it hasn’t resolve the problem. I managed to catch it happening once and
run a perf which showed:

+ 41.40% 48154 postmaster 0x347ba9 f 0x347ba9
+ 9.55% 10956 postmaster 0x2dc820 f
set_config_option
+ 8.64% 9946 postmaster 0x5a3d4 f writeListPage
+ 5.75% 6609 postmaster 0x5a2b0 f
ginHeapTupleFastCollect
+ 2.68% 3084 postmaster 0x192483 f
build_implied_join_equality
+ 2.61% 2990 postmaster 0x187a55 f
build_paths_for_OR
+ 1.86% 2131 postmaster 0x794aa f
get_collation_oid
+ 1.56% 1822 postmaster 0x5a67e f
ginHeapTupleFastInsert
+ 1.53% 1766 postmaster 0x1929bc f
distribute_qual_to_rels
+ 1.33% 1558 postmaster 0x249671 f cmp_numerics

I‘m not sure what 0x347ba9 represents, or why it’s an address rather than a
method name.

That's about the sum of it. Any help would be greatly appreciated and if
you want any more information about our setup, please feel free to ask.

Thanks,
Dave

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Andrew Dunstan 2013-09-10 15:26:30 Re: Intermittent hangs with 9.2
Previous Message Andrew Dunstan 2013-09-10 13:41:48 Re: Performance bug in prepared statement binding in 9.2?