A WalSnd issue related to state WALSNDSTATE_STOPPING

From: Paul Guo <pguo(at)pivotal(dot)io>
To: pgsql-hackers(at)postgresql(dot)org
Subject: A WalSnd issue related to state WALSNDSTATE_STOPPING
Date: 2018-11-21 04:54:30
Message-ID: CAEET0ZEv8VFqT3C-cQm6byOB4r4VYWcef1J21dOX-gcVhCSpmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hello,

Recently I encountered a panic (assert failure) on a cassert enabled build
of Greenplum database (an MPP database based on postgres). The based
postgresql version is 9.4 stable (9.4.19). The panic is not complex. Please
see below for more details. This issue seems to be a bug in postgresql also
so I'm writing to community.

(gdb) bt

#0 0x00007fb8c56321f7 in __GI_raise (sig=sig(at)entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fb8c56338e8 in __GI_abort () at abort.c:90

#2 0x0000000000ae3fd2 in ExceptionalCondition (conditionName=0xdcec75
"!(syncWalSnd)",
errorType=0xdce77a "FailedAssertion", fileName=0xdce770 "syncrep.c",
lineNumber=505) at assert.c:66
#3 0x0000000000924dbc in SyncRepReleaseWaiters () at syncrep.c:505

#4 0x0000000000915e5e in ProcessStandbyReplyMessage () at
walsender.c:1667
#5 0x0000000000915bea in ProcessStandbyMessage () at walsender.c:1566

#6 0x0000000000915ab5 in ProcessRepliesIfAny () at walsender.c:1502

#7 0x0000000000916303 in WalSndLoop (send_data=0x916d46
<XLogSendPhysical>) at walsender.c:1923
#8 0x000000000091439b in StartReplication (cmd=0x15abea8) at
walsender.c:704
#9 0x000000000091586e in exec_replication_command (

cmd_string=0x161fa78 "START_REPLICATION 1/84000000 TIMELINE 6") at
walsender.c:1416
#10 0x0000000000979915 in PostgresMain (argc=1, argv=0x15bafd8,
dbname=0x15baec8 "", username=0x15baea8 "gpadmin")
at postgres.c:5296

(gdb) p max_wal_senders

$1 = 1

(gdb) p WalSndCtl->walsnds[0]

$2 = {

pid = 2765,

state = WALSNDSTATE_STOPPING,

sentPtr = 6509560784,

sendKeepalive = 0 '\000',

needreload = 0 '\000',

write = 6509560784,

flush = 6509560664,

apply = 6509560664,

caughtup_within_range = 1 '\001',

xlogCleanUpTo = 6509560664,

mutex = 0 '\000',

latch = {

is_set = 1,

is_shared = 1 '\001',

owner_pid = 2765

},

sync_standby_priority = 1,

synchronous = 0 '\000',

replica_disconnected_at = 0

}

Below is related code.

for (i = 0; i < max_wal_senders; i++)
{
/* use volatile pointer to prevent code rearrangement */
volatile WalSnd *walsnd = &walsndctl->walsnds[i];

if (walsnd->pid != 0 &&
walsnd->state >= WALSNDSTATE_STREAMING &&
walsnd->sync_standby_priority > 0 &&
(priority == 0 ||
priority > walsnd->sync_standby_priority) &&
!XLogRecPtrIsInvalid(walsnd->flush))
{
priority = walsnd->sync_standby_priority;
syncWalSnd = walsnd;
}
}

/*
* We should have found ourselves at least.
*/
Assert(syncWalSnd);

The panic reason is that since there is just one wal sender and
WalSndCtl->walsnds[0].state is WALSNDSTATE_STOPPING so syncWalSnd will be
NULL and that causes assert failure. Latest postgresql code removes the
Assert code but the related similar code logic was kept. It looks like that
we need to modify the logic similar like below (PG 9.4 STABLE) to allow
WALSNDSTATE_STOPPING which is reasonable here If I understand correctly.

if (walsnd->pid != 0 &&
- walsnd->state == WALSNDSTATE_STREAMING &&
+ walsnd->state >= WALSNDSTATE_STREAMING &&
walsnd->sync_standby_priority > 0 &&
(priority == 0 ||
priority > walsnd->sync_standby_priority) &&

For Latest master, the logic is as below but it could be modified
similarly.
/* Must be streaming */
if (state != WALSNDSTATE_STREAMING)
continue;

If yes this change should be on all branches that have the patch which
introduces WALSNDSTATE_STOPPING.

Thanks.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hironobu SUZUKI 2018-11-21 05:03:52 Re: 64-bit hash function for hstore and citext data type
Previous Message Thomas Munro 2018-11-21 04:51:17 Re: Wonky rename semantics on Windows