[BUG] False indication in pg_stat_replication.sync_state

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [BUG] False indication in pg_stat_replication.sync_state
Date: 2012-10-18 08:42:08
Message-ID: 20121018.174208.239638814.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello. My colleague found that pg_stat_replication.sync_state
shows false state for some condition.

This prevents Pacemaker from completing fail-over that could
safely be done.

The point is in walsender.c, pg_stat_get_wal_senders() below, (as
of REL9_2_1)

1555: if (walsnd->pid != 0)
1556: {
1557: /*
1558: * Treat a standby such as a pg_basebackup background process
1559: * which always returns an invalid flush location, as an
1560: * asynchronous standby.
1561: */
! 1562: sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
1563: 0 : walsnd->sync_standby_priority;

Here, XLogRecPtrIsInvalid(walsnd->flush) is defined as
(walsnd->flush.xrecoff == 0) which becomes true as usual at every
WAL 'file's (not segments) boundary. xrecoff == 0 is certainly
invalid for the start point of WAL *RECORD*, but should be
considered valid in replication stream. This check was introduced
at 9.2.0 and the version up between 9.1.4 and 9.1.5.

| DEBUG: write 4/0 flush 3/FEFFEC68 apply 3/FEFFEC68
| DEBUG: write 4/0 flush 4/0 apply 3/FEFFEC68
| DEBUG: HOGE: flush = 3/FEFFEC68 sync_priority[0] = 1
| DEBUG: write 4/111C0 flush 4/0 apply 3/FEFFECC0
!| DEBUG: HOGE: flush = 4/0 sync_priority[0] = 0

This value zero of sync_priority[0] makes sync_status 'async'
errorneously and confuses Pacemaker.

# The log line marked with 'HOGE' above printed by applying the
# patch at the bottom of this message and invoking 'select
# sync_state from pg_stat_replication' periodically. To increase
# the chance to see the symptom, sleep 1 second for 'file'
# boundaries :-)

The Heikki's recent(?) commit
0ab9d1c4b31622e9176472b4276f3e9831e3d6ba which changes the format
of XLogRecPtr from logid:xrecoff struct to 64 bit linear address
would fix the false indication. But I suppose this patch won't be
applied to existing 9.1.x and 9.2.x because of the modification
onto streaming protocol.

As far as I see the patch, it would'nt change the meaning of
XLogRecPtr to change XLogRecPtrIsInvalid from (xrecoff == 0) to
(xrecoff == 0 && xlogid == 0). But this change affects rather
wide portion where handling WAL nevertheless what is needed here
is only to stop the false indication.

On the other hand, pg_basebackup seems return 0/0 as flush and
apply positions so it seems enough only to add xlogid == 0 into
the condition. The patch attached for REL9_2_1 does this and
yields the result following.

| DEBUG: write 2/FEFFFD48 flush 2/FEFFFD48 apply 2/FEFF7AB0
| DEBUG: write 3/0 flush 2/FEFFFD48 apply 2/FEFF7E88
| DEBUG: write 3/0 flush 3/0 apply 2/FEFFFD48
| DEBUG: HOGE: flush = 2/FEFFFD48 sync_priority[0] = 1
| DEBUG: write 3/E338 flush 3/0 apply 2/FEFFFF80
!| DEBUG: HOGE: flush = 3/0 sync_priority[0] = 1

I think this patch should be applied for 9.2.2 and 9.1.7.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

============================================================
===== The patch for this test.
============================================================
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 064ddd5..19f79d1 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -618,6 +618,10 @@ ProcessStandbyReplyMessage(void)
reply.flush.xlogid, reply.flush.xrecoff,
reply.apply.xlogid, reply.apply.xrecoff);

+ if (reply.write.xrecoff == 0 ||
+ reply.flush.xrecoff == 0)
+ sleep(1);
+
/*
* Update shared state for this WalSender process based on reply data from
* standby.
@@ -1561,7 +1565,10 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
*/
sync_priority[i] = XLogRecPtrIsInvalid(walsnd->flush) ?
0 : walsnd->sync_standby_priority;
-
+ elog(DEBUG1, "HOGE: flush = %X/%X sync_priority[%d] = %d",
+ walsnd->flush.xlogid, walsnd->flush.xrecoff,
+ i, sync_priority[i]);
+
if (walsnd->state == WALSNDSTATE_STREAMING &&
walsnd->sync_standby_priority > 0 &&
(priority == 0 ||

Attachment Content-Type Size
20121018_sync_state_patch_v1.patch text/x-patch 831 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-10-18 09:20:23 Re: Bugs in CREATE/DROP INDEX CONCURRENTLY
Previous Message Sebastien FLAESCH 2012-10-18 08:15:22 Database object names and libpq in UTF-8 locale on Windows