Re: pg_stat_replication.*_lag sometimes shows NULL during active replication

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Shinya Kato <shinya11(dot)kato(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
Date: 2026-03-12 15:27:32
Message-ID: CAHGQGwGPg20Rw2PydBDXiKHgn55s--E14-qRy7t3M+DHreCJww@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 11, 2026 at 11:39 AM Shinya Kato <shinya11(dot)kato(at)gmail(dot)com> wrote:
>
> On Tue, Mar 10, 2026 at 10:54 AM Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > Even with your latest patch, if we remove fullyAppliedLastTime, and set
> > clearLagTimes to true when applyPtr == sentPtr && noLagSamples &&
> > positionsUnchanged,
> > wouldn't the time for the lag to become NULL be almost the same as
> > wal_receiver_status_interval?
> >
> > The documentation doesn't clearly specify how long it should take for
> > the lag to become NULL, so doubling that time might be acceptable.
> > However, if we can keep it roughly the same without much complexity,
> > I think that would be preferable.
> >
> > Thought?
>
> Thank you for the suggestion. I tested this by removing
> fullyAppliedLastTime, but even with synchronous replication, NULL
> still appears. Here is why:
>
> - Reply 1 (flush notification): positions = X. Lag samples are
> consumed with real values, so noLagSamples = false. clearLagTimes is
> not set, and prevPtrs = X is saved.
>
> - Reply 2 (force_reply): positions = X again. Here, noLagSamples =
> true and positionsUnchanged = true. Since applyPtr == sentPtr,
> clearLagTimes is set to true, resulting in a NULL value.
>
> Therefore, I believe fullyAppliedLastTime is still necessary to ensure
> that the previous reply also contained no lag samples.

Thanks for testing and for the clarification! You're right.

However, if we apply this change, the time required for the lag information to
be reset would effectively double. I start wondering if that's really
acceptable, especially for back branches. Although the docs doesn't clearly
specify this timing, doubling it could affect systems that monitor
replication lag, for example. It might still be reasonable to apply
such a change in master, though.

On further thought, the root cause seems to be that walreceiver can send
two consecutive status reply messages with identical WAL locations even
when wal_receiver_status_interval has not yet elapsed. Addressing that
behavior directly might resolve the issue you reported. I've attached a PoC
patch that does this. Thought?

Regards,

--
Fujii Masao

Attachment Content-Type Size
v4-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch application/octet-stream 9.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2026-03-12 15:28:51 Re: Drop 32-bit support (was "Re: Fix typo 586/686 in atomics/arch-x86.h")
Previous Message Yura Sokolov 2026-03-12 15:20:17 Re: Drop 32-bit support (was "Re: Fix typo 586/686 in atomics/arch-x86.h")