From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Fix lag columns in pg_stat_replication not advancing when replay LSN stalls |
Date: | 2025-10-17 03:56:55 |
Message-ID: | CAHGQGwGdGQ=1-X-71Caee-LREBUXSzyohkoQJd4yZZCMt24C0g@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
While testing, I noticed that write_lag and flush_lag in pg_stat_replication
initially advanced but eventually stopped updating. This happened when
I started pg_receivewal, ran pgbench, and periodically monitored
pg_stat_replication.
My analysis shows that this issue occurs when any of the write, flush,
or replay LSNs in the standby’s feedback message stop updating for some time.
In the case of pg_receivewal, the replay LSN is always invalid (never updated),
which triggers the problem. Similarly, in regular streaming replication,
if the replay LSN remains unchanged for a long time—such as during
a recovery conflict—the lag values for both write and flush can stop advancing.
The root cause seems to be that when any of the LSNs stop updating,
the lag tracker's cyclic buffer becomes full (the write head reaches
the slowest read head). In this situation, LagTrackerWrite() and
LagTrackerRead() didn't handle the full-buffer condition properly.
For instance, if the replay LSN stalls, the buffer fills up and the read heads
for "write" and "flush" end up at the same position as the write head.
This causes LagTrackerRead() to return -1 for both, preventing write_lag
and flush_lag from advancing.
The attached patch fixes the problem by treating the slowest read entry
(the one causing the buffer to fill up) as a separate overflow entry,
allowing the lag tracker to continue operating correctly.
Thoughts?
Regards,
--
Fujii Masao
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Fix-lag-columns-in-pg_stat_replication-not-advanc.patch | application/octet-stream | 3.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | shveta malik | 2025-10-17 04:06:53 | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Previous Message | Hayato Kuroda (Fujitsu) | 2025-10-17 03:41:22 | RE: Question for coverage report |