| From: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
|---|---|
| To: | Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
| Cc: | Japin Li <japinli(at)hotmail(dot)com>, surya poondla <suryapoondla4(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Subject: | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication |
| Date: | 2026-04-07 11:48:13 |
| Message-ID: | CAJpy0uAFgnAGTmALaPH-3KKDi7XR0C9E__FLSi98H+h59+1UwA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Apr 7, 2026 at 3:56 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Tue, Apr 7, 2026 at 11:20 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> >
> > Hi,
> >
> > On Tue, Apr 7, 2026 at 9:04 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > >
> > > I see your point. I agree that using wal_receiver_status_interval for
> > > this test may not be a reliable way. Can we attempt using
> > > pg_wal_replay_pause() on standby and then checking
> > > wait_event=WaitForStandbyConfirmation with backend_type=walsender on
> > > primary? Or do you see any issues in this approach that I might be
> > > overlooking?
> > >
> >
> > Yes, I think we can make use of the WAL replay pause/resume mechanism.
> > This seems like the right approach, as it gives us a more controlled
> > and deterministic way to validate the lagging behavior.
> >
>
> Looking at 049_wait_for_lsn.pl (the test case you referenced), it
> explicitly stops the WAL receiver by setting primary_conninfo to an
> empty string, rather than just pausing WAL replay.
Oh, I missed it in that testcase. Setting primary_conninfo to NULL
essentially means not starting the walreceiver and thus making the
standby slot as inactive, for which we already have a testcase.
> Using
> pg_wal_replay_pause() alone only halts replay; the WAL receiver
> continues running, keeps receiving WAL, and sends feedback/status to
> the primary. That feedback is sufficient to advance restart_lsn on the
> standby’s slot, which would violate the restart_lsn < wait_for_lsn
> condition inside StandbySlotsHaveCaughtup(), which is not what we
> want.
Yes, I see. IIUC, the same problem will be there if we use
recovery_min_apply_delay i.e., WALs will be received, flushed and
feedback will be sent to primary, only replay will be delayed. We can
use 'synchronous_commit = remote_apply' along with
'recovery_min_apply_delay ' but that would mean delaying logical
replication because transaction commit is blocking not because standby
is actually lagging. It will not be a suitable test for
'synchronized_satndby_slots'.
>
> This leads to the question: can we construct a realistic test case
> where a failover standby remains active (WAL receiver running) while
> its restart_lsn is still genuinely lagging and consistently so? This
> likely needs further exploration.
>
I have no more ideas here. We can get rid of lagging testcase. But
let's wait for a day to see if Hou-San has any further ideas on how to
write a deterministic testcase here.
thanks
Shveta
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter Eisentraut | 2026-04-07 11:53:13 | Re: SQL:2011 Application Time Update & Delete |
| Previous Message | Heikki Linnakangas | 2026-04-07 11:27:40 | Re: Reduce build times of pg_trgm GIN indexes |