Undetected deadlock with pg_dump on hot_standby server

From: Oliver Seemann <oseemann(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Undetected deadlock with pg_dump on hot_standby server
Date: 2016-03-10 12:50:28
Message-ID: CANCipfoOA=P0zoykZY-XCLVvpKONm+aNVCOmSY96SE6SyhwecA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi!

We observe the following unexpected behaviour:
We run pg_dump on the hot_standby slave every two hours and once in 2-3 weeks it
just hangs forever and the slave stops replaying logs.

Version: 9.3.11
OS: Debian GNU/Linux 7.9 (wheezy)
GUC hot_standby_feedback is off.

(See also attached pg_stat_activity_and_pg_locks.txt)
Looking at pg_stat_activity on the slave we see pg_dump trying to acquire a
lock, with state=active, waiting=t.

Looking at pg_locks we see an entry for pg_dump with granted=f on relation
31353 and lots of other entries for the pg_dump pid (570) on other relations
with granted=t. An AccessExclusiveLock on that relation 31353 is held by
another pid (2033) with virtualtransaction 1/0.

Pid 2033 is the postmaster process. strace says it's waiting in a loop:
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)

GDB Backtrace:

#0 0x00007f99d355edd3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f99d57a6c4e in pg_usleep ()
#2 0x00007f99d56836a0 in ?? ()
#3 0x00007f99d5683b43 in StandbyAcquireAccessExclusiveLock ()
#4 0x00007f99d5683f6c in standby_redo ()
#5 0x00007f99d54ed01d in StartupXLOG ()
#6 0x00007f99d565454f in StartupProcessMain ()
#7 0x00007f99d54f8d0d in AuxiliaryProcessMain ()
#8 0x00007f99d56512f3 in ?? ()
#9 0x00007f99d56537f7 in PostmasterMain ()
#10 0x00007f99d54886d0 in main ()

Looking at the disassembly and the registers it appears that
StandbyAcquireAccessExclusiveLock() is called with arguments dbOid=30850,
relOid=31262. Looking again at pg_locks, a lock on 31262 is already held by
pg_dump!

So it looks like a deadlock between pg_dump and standby_redo() in the
postmaster that is not detected, making both wait indefinitely. We
would expect the pg_dump session to die when a conflict is
encountered, but the fact that it and the postmaster just hang forever
makes it look like a bug.

We do not call pg_xlog_replay_pause()/resume() around the pg_dump call, we'll
try that now as a workaround.

Oliver

Attachment Content-Type Size
pg_stat_activity_and_pg_locks.txt text/plain 13.1 KB
gdb.txt text/plain 1.4 KB

Browse pgsql-bugs by date

  From Date Subject
Next Message Nick Cleaton 2016-03-10 13:13:37 Re: streaming replication master can fail to shut down
Previous Message tigreavecdesailes 2016-03-10 12:22:49 BUG #14014: postgresql95-setup script determines PGDATA wrongly