walsender vs. XLogBackgroundFlush during shutdown

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: walsender vs. XLogBackgroundFlush during shutdown
Date: 2019-05-01 00:28:45
Message-ID: 20190501002845.uyacu4p36zcubbs2@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I see there's an ongoing discussion about race conditions in walreceiver
blocking shutdown, so let me start a new thread about a race condition in
walsender during shutdown.

The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck,
waiting for walsender processes to catch-up and terminate indefinitely.

The reason for that is pretty simple - the walsenders are doing logical
decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
But per XLogSendLogical() this only happens if

if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
{
WalSndCaughtUp = true;
...
}

That is, we need to get beyong GetFlushRecPtr(). But that may never
happen, because there may be incomplete (and unflushed) page in WAL
buffers, not forced out by any transaction. So if there's a WAL record
overflowing to the unflushed page, the walsender will never catch up.

Now, this situation is apparently expected, because WalSndWaitForWal()
does this:

/*
* If we're shutting down, trigger pending WAL to be written out,
* otherwise we'd possibly end up waiting for WAL that never gets
* written, because walwriter has shut down already.
*/
if (got_STOPPING)
XLogBackgroundFlush();

but unfortunately that does not actually do anything, because right at
the very beginning XLogBackgroundFlush() does this:

/* back off to last completed page boundary */
WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;

That is, it intentionally ignores the incomplete page, which means the
walsender can't read the record and reach GetFlushRecPtr().

XLogBackgroundFlush() does this since (at least) 2007, so how come we
never had issues with this before?

Firstly, walsenders used for physical replication don't have this issue,
because they don't need to decode the WAL record overflowing to the
incomplete/unflushed page. So it seems only walsenders used for logical
decoding are vulnerable to this.

Secondly, this seems to happen only when a bit of WAL is generated just
at the right moment during shutdown. I have initially ran into this issue
with the failover slots (i.e. the patch that was committed and reverted
in the 9.6 cycle, IIRC), which is doing exactly this - it's writing the
slot positions into WAL during shutdown. Which made it pretty trivial to
trigger this issue.

But I don't think we're safe without the failover slots patch, because
any output plugin can do something similar - say, LogLogicalMessage() or
something like that. I'm not aware of a plugin doing such things, but I
don't think it's illegal / prohibited either. (Of course, plugins that
generate WAL won't be useful for decoding on standby in the future.)

So what I think we should do is to tweak XLogBackgroundFlush() so that
during shutdown it skips the back-off to page boundary, and flushes even
the last piece of WAL. There are only two callers anyway, so something
like XLogBackgroundFlush(bool) would be simple enough.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-05-01 00:30:23 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Previous Message Robert Haas 2019-05-01 00:26:03 Re: Initializing LWLock Array from Server Code