Re: standby apply lag on inactive servers

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: standby apply lag on inactive servers
Date: 2020-01-27 20:34:19
Message-ID: 20200127203419.GA15216@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020-Jan-10, Alvaro Herrera wrote:

> A customer of ours complained that if you have an inactive primary,
> monitoring the apply lag on a standby reports monotonically increasing
> lag. The reason for this is that the apply lag is only updated on
> COMMIT records, which of course don't occur in inactive servers.
> But CHECKPOINT records do occur, so the WAL insert pointer continues to
> move forward, which is what causes the spurious lag.
>
> (I think newer releases are protected from this problem because they
> don't emit checkpoints during periods of inactivity. I didn't verify
> this.)
>
> This patch fixes the problem by using the checkpoint timestamp to update
> the lag tracker in the standby. This requires a little change in where
> this update is invoked, because previously it was done only for the XACT
> rmgr; this makes the patch a little bigger than it should.

Here's a version of the patch that applies to current master. It does
fix the problem that CHECKPOINT wal records are not considered when
determining time-of-latest-record.

However, it does *not* fix the monitoring problem I mentioned (which
relied on comparing pg_last_xact_replay_timestamp() to 'now()') ...
because commit 6ef2eba3f57f (pg10) made an idle server not emit
checkpoint records anymore. That is, my parenthical remark was
completely wrong: the new versions not only are "protected", but also
this fix doesn't fix them. Luckily, the way to fix monitoring for
servers of versions 10 and later is to use the new replay_lag (etc)
columns in pg_stat_replication, commit 6912acc04f0b (also pg10).

I am inclined to apply this to all branches unless there are strong
objections, because the current code seems pretty arbitrary anyway.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v2-0001-Use-CheckPoint-time-to-update-latest-recovery-tim.patch text/x-diff 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Zhang 2020-01-27 20:41:00 Re: Making psql error out on output failures
Previous Message Mark Dilger 2020-01-27 20:05:44 Re: making the backend's json parser work in frontend code