Re: Safe way to check for PostgreSQL replication delay/lag

From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Strahinja Kustudić <strahinjak(at)nordeus(dot)com>, "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Safe way to check for PostgreSQL replication delay/lag
Date: 2014-03-17 16:39:45
Message-ID: 53272551.6050205@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On 03/15/2014 12:38 PM, Strahinja Kustudić wrote:
> Looking at the documentation and all the blog posts about how to
> monitor replication delay I don't think there is one good and most
> importantly safe solution which works all the time.

I have a basic homegrown script, the core of which is below, that runs
this query on my servers (running 9.1 - settings may need tweaking for
other versions). It is a bash script that runs every minute via cron and
sets $standby_delay to a value in seconds. The script sends varying
levels of alert depending on the severity of the delay. (We must be
sized appropriately since we essentially never get alerts unless we have
restarted a server for maintenance.)

The script uses a one-record/one-column table called sync_status to hold
a timestamptz. To keep things standard across machines, it determines
whether it is a master or replica based on the value of
'transaction_read_only' in pg_settings and either updates sync_status or
reads it, along with other xact_replay information, to determine the
worst-case delay.

The core of the script writes a query to a temporary file then executes
that query. The output will be the worst-case measurement of delay
seconds (or zero if the master) which is saved in $standby_delay. An
empty value indicates a failure of the script to run. Obviously there is
more to the script both in setting the value of creating and setting
$tempquery, cleaning up temp files, and generating the appropriate
alerts based on the severity of the delay.

Your alert methods and severity levels are up to you. Note that when run
by cron every minute there exists a reasonable possibility that the
slave can report a 60-second delay when actually caught up so you will
need to account for that in setting your alert levels.

standby_delay=$(
psql -q --tuples-only --no-align 2>/dev/null <<EOS
\o ${tempquery}
select
case when setting='on' then
'
with logdelay as
(
select
case when
pg_last_xlog_receive_location()=pg_last_xlog_replay_location() then 0::int
else
(extract(epoch from now())-extract(epoch from
pg_last_xact_replay_timestamp()))::int
end as replicadelay
union
select
(extract(epoch from now())-extract(epoch from sync_time))::int
as replicadelay
from
sync_status
)
select
max(replicadelay)
from
logdelay
;
'
else
'
begin;
delete from sync_status;
insert into sync_status (sync_time) values (now()) returning 0::int as
replicadelay;
commit;
'
end
from pg_settings where name='transaction_read_only';
\o
\i ${tempquery}
EOS
)

Cheers,
Steve

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Kasia Tuszynska 2014-03-17 21:37:08 looking for doc
Previous Message digoal@126.com 2014-03-17 05:06:03 Can we split INSERT, UPDATE, DELETE config from mod@log_statement ?