wal_sender_timeout should ignore server-side latency

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: wal_sender_timeout should ignore server-side latency
Date: 2018-08-26 03:46:00
Message-ID: 20180826034600.GA1105084@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

WalSndLoop() does this, simplifying considerably:

for (;;)
{
/* does: last_reply_timestamp = GetCurrentTimestamp() */
ProcessRepliesIfAny();
send_data(); /* e.g. XLogSendPhysical(), which calls XLogRead() */
WalSndCheckTimeOut(GetCurrentTimestamp());
}

A consequence is that any time spent in the send_data() callback counts
against the timeout. In particular, if a single send_data() takes longer than
wal_sender_timeout, the client is powerless to prevent a timeout. This
disagrees with the wal_sender_timeout documentation ("Terminate replication
connections that are inactive longer than the specified number of
milliseconds. This is useful for the sending server to detect a standby crash
or network outage"). I find it undesirable.

The fix, attached, is to interpret the timeout relative to a timestamp taken
before ProcessRepliesIfAny() polls the socket. If that timestamp is
wal_sender_timeout later than the last reply, we can terminate with
confidence. This adds one gettimeofday() per ProcessRepliesIfAny() finding no
replies, which feels cheap enough.

We've seen a number of wal_sender_timeout buildfarm failures on systems with
I/O performance trouble:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-08-16%2020:55:57
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2018-06-30%2020:38:10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2018-04-12%2018:12:36
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2018-01-13%2005:01:17
https://postgr.es/m/flat/20170604211229(dot)GA1528911(at)rfd(dot)leadboat(dot)com

Fixing $SUBJECT won't necessarily cure that, because an I/O stall on the
client side can still cause a failure. We'd need something like threads or
async I/O to avoid that. I mention a less-important corner case in the
WalSndCheckTimeOut() header comment. You can simulate slow XLogSendPhysical()
to explore these problems on any system:

--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,2 +65,3 @@
#include "libpq/pqformat.h"
+#include "libpq/pqsignal.h"
#include "miscadmin.h"
@@ -2731,2 +2732,5 @@ XLogSendPhysical(void)
enlargeStringInfo(&output_message, nbytes);
+ PG_SETMASK(&BlockSig);
+ pg_usleep(65 * 1000 * 1000);
+ PG_SETMASK(&UnBlockSig);
XLogRead(&output_message.data[output_message.len], startptr, nbytes);

Attachment Content-Type Size
wal_sender_timeout-server-independent-v1.patch text/plain 6.3 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2018-08-26 06:16:51 Re: JIT compiling with LLVM v12
Previous Message Tom Lane 2018-08-26 03:29:27 Re: has_table_privilege for a table in unprivileged schema causes an error