Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Japin Li <japinli(at)hotmail(dot)com>, surya poondla <suryapoondla4(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication
Date: 2026-04-07 03:34:26
Message-ID: CAJpy0uD_BYECgS2OQ6h4UxZebhcKaWpGb574WCsO_5yYh9moxw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 6, 2026 at 6:37 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> On Fri, Apr 3, 2026 at 2:21 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Fri, Apr 3, 2026 at 9:46 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Apr 2, 2026 at 3:55 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> > > >
> > > > Hi Shveta,
> > > >
> > > > On Wed, Apr 1, 2026 at 12:06 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Thu, Mar 26, 2026 at 5:23 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> > > > > >
> > > > > >
> > > > > > PFA patch addressing all the comments above and let me know for any
> > > > > > further comments.
> > > > > >
> > > > >
> > > > > Thank You Ashutosh. Doc looks good to me. Few comments:
> > > > >
> > > > > 3)
> > > > > What is the execution time for this new test?
> > > > > I ran it on my VM (which is slightly on the slower side), and the
> > > > > runtime varies between ~60 seconds and ~140 seconds. I executed it
> > > > > around 10–15 times. Most runs completed in about 65 seconds (which is
> > > > > still more), but a few were significantly longer (100+ seconds).
> > > > > During the longer runs, I noticed the following entry in pub.log
> > > > > (possibly related to Test Scenario E taking more time?). Could you
> > > > > please try running this on your end as well?
> > > > >
> > > > > 2026-03-31 19:45:45.557 IST client backend[145705]
> > > > > 053_synchronized_standby_slots_quorum.pl LOG: statement: SELECT
> > > > > active_pid IS NOT NULL
> > > > > AND restart_lsn IS NOT NULL
> > > > > AND restart_lsn < '0/03000450'::pg_lsn
> > > > > FROM pg_replication_slots
> > > > > WHERE slot_name = 'sb1_slot';
> > > > >
> > > > > Just for reference, the complete failover test
> > > > > (t/040_standby_failover_slots_sync.pl) takes somewhere between 7 to
> > > > > 10sec on my VM.
> > > > >
> > > >
> > > > My concern with this new test is that it's both slow to run and prone
> > > > to flakiness, which makes me question whether it's worth keeping.
> > > >
> > >
> > > will review and share my thoughts.
> > >
> >
> > I gave it more thought, another idea for a shorter and quicker
> > testcase could be to check wait_event for that particular
> > application_name in pg_stat_activity. A lagging standby will result in
> > wait_event=WaitForStandbyConfirmation with backend_type=walsender.
> >
> > I have attached sample-code to do the same in the attached txt file,
> > please have a look. I discussed with Hou-San offline, he is okay with
> > this idea. Please see if it works and change it as needed.
> >
>
> More than the execution time, I'm concerned if the test-case
> effectively validates what we want.
>

I see your point. I agree that using wal_receiver_status_interval for
this test may not be a reliable way. Can we attempt using
pg_wal_replay_pause() on standby and then checking
wait_event=WaitForStandbyConfirmation with backend_type=walsender on
primary? Or do you see any issues in this approach that I might be
overlooking?

> With below setup, here is what I observe:
>
> Setup:
>
> Primary : psql -p 5555 (synchronous_standby_names = 'ANY 1
> (standby1, standby2)'; synchronized_standby_slots = 'FIRST 1
> (sb1_slot, sb2_slot)')
> Standby1 : psql -p 5556 (wal_receiver_status_interval=0)
> Standby2 : psql -p 5557 (wal_receiver_status_interval=10s)
>
> --
>
> Observations:
>
> [local]:5555 ashu(at)postgres=# SELECT pg_logical_emit_message(true,
> 'qtest', 'first_1_lagging_blocks_1');
> pg_logical_emit_message
> -------------------------
> 0/04000220
> (1 row)
>
> Time: 14.378 ms
>
> [local]:5555 ashu(at)postgres=# select slot_name, active_pid, restart_lsn
> from pg_replication_slots where slot_type = 'physical';
> slot_name | active_pid | restart_lsn
> -----------+------------+-------------
> sb1_slot | 105328 | 0/04000250
> sb2_slot | 105381 | 0/04000250
> (2 rows)
>
> Time: 1.370 ms
>
> --
>
> [local]:5555 ashu(at)postgres=# SELECT pg_logical_emit_message(true,
> 'qtest', 'first_1_lagging_blocks_2');
> pg_logical_emit_message
> -------------------------
> 0/040002A0
> (1 row)
>
> Time: 13.533 ms
>
> [local]:5555 ashu(at)postgres=# select slot_name, active_pid, restart_lsn
> from pg_replication_slots where slot_type = 'physical';
> slot_name | active_pid | restart_lsn
> -----------+------------+-------------
> sb1_slot | 105328 | 0/040002D0
> sb2_slot | 105381 | 0/040002D0
> (2 rows)
>
> --
>
> Takeaways:
>
> 1) In both the cases, even though wal_receiver_status_interval = 0 on
> standby1, the restart_lsn of the standby1 quickly moved past the lsn
> of the logical message emitted which kind of gives sense that
> wal_receiver_status_interval = 0 disables periodic status packets, but
> receiver/walsender still exchange feedback on other events, so slot
> restart_lsn can move quickly.
> 2) On a fast local setup, both sb1_slot and sb2_slot can advance past
> the emitted LSN before we query pg_replication_slots making the
> test-case flaky/nondeterministic, it becomes time sensitive.
>
> --
> With Regards,
> Ashutosh Sharma.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2026-04-07 03:41:10 Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication
Previous Message Amit Kapila 2026-04-07 03:32:54 Re: pgsql: Reduce log level of some logical decoding messages from LOG to D