Quick Links

Re: Better way to monitor for failed replication?

From:	Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
To:	Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject:	Re: Better way to monitor for failed replication?
Date:	2026-01-09 17:53:51
Message-ID:	CANzqJaBwJJj_-hoN1ZA21HLQi4KDdBYvO6FErCK7o61_MPmReg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin

On Fri, Jan 9, 2026 at 12:42 PM Keith <keith(at)keithf4(dot)com> wrote:

>
>
> On Fri, Jan 9, 2026 at 10:50 AM Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
> wrote:
>
>> Currently, in a bash script, I run this SELECT statement against the
>> Primary server which is supposed to replicate to multiple servers. If
>> active == f, I send an alter email.
>>
>> postgres=# SELECT rs.slot_name, rs.active, sr.client_hostname
>> from pg_replication_slots rs
>> left outer join pg_stat_replication sr on rs.active_pid = sr.pid;
>> slot_name | active | client_hostname
>> --------------+--------+-----------------
>> pgstandby1 | t | BBOPITCPGS302B
>> replicate_dr | f |
>> (2 rows)
>>
>> Is there a better way to check for replication that's supposed to be
>> happening, but isn't (like PG on the replica was stopped for some reason)?
>>
>

>
> Your example only takes into account if you are using replication slots,
> correct? If you're always using those, this is definitely a good metric to
> have since the slot going down means WAL buildup, so I'd definitely keep it.
>

Yes, just replication slots.

> As for general replication monitoring, these have been the two queries I
> use
>
> On the Primary:
>
> SELECT client_addr AS replica
> , client_hostname AS replica_hostname
> , client_port AS replica_port
> , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
> FROM pg_catalog.pg_stat_replication;
>
> This checks for byte-lag for all active streaming replicas, physical or
> logical. A count of zero or NULL from this metric means all replicas are
> down. Can monitor a specific count if you have a known number of replicas.
>
> On any Replica:
>
> SELECT
> CASE
> WHEN (pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()) OR
> (pg_is_in_recovery() = false) THEN 0
> ELSE EXTRACT (EPOCH FROM clock_timestamp() -
> pg_last_xact_replay_timestamp())::INTEGER
> END
> AS replay_time
> , CASE
> WHEN pg_is_in_recovery() = false THEN 0
> ELSE EXTRACT (EPOCH FROM clock_timestamp() -
> pg_last_xact_replay_timestamp())::INTEGER
> END
> AS received_time;
>
> This monitors the lag in seconds from the replica. Technically it monitors
> the last time a WAL file was received (received_time) and the last time WAL
> was actually replayed (replay_time). The reason for both is that the
> received time can be a false positive when there is no write activity on
> the primary. If there's always supposed to be write activity, this can be a
> another good metric to indicate that something is very wrong. The
> replay_time metric avoids the false positive by only being considered when
> receive is different than replay.
>

I'll integrate this into the lag report.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

In response to

Re: Better way to monitor for failed replication? at 2026-01-09 17:41:57 from Keith

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Gabriel Guillem Barceló Soteras	2026-01-13 08:19:02	VACUUM/ANALYZE Strategy for Low-Activity PostgreSQL 15 Instance
Previous Message	Keith	2026-01-09 17:41:57	Re: Better way to monitor for failed replication?