Re: Synchronizing slots from primary to standby

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-12-22 14:29:28
Message-ID: ZYWdSIeAMQQcLmVT@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, Dec 22, 2023 at 04:02:21PM +0530, shveta malik wrote:
> PFA v53. Changes are:

Thanks!

> patch002:
> 2) Addressed comments in [2] for v52-002.
> 3) Fixed CFBot failure. The failure was caused by an assert in
> wait_for_primary_slot_catchup() for null confirmed_lsn received. In
> wait_for_primary_slot_catchup(), we had an assumption that if
> restart_lsn is valid and 'conflicting' is also false, then we must
> have non-null confirmed_lsn. But this is not true. It is possible to
> get null values for confirmed_lsn and catalog_xmin if on the primary
> server the slot is just created with a valid restart_lsn and slot-sync
> worker has fetched the slot before the primary server could set valid
> confirmed_lsn and catalog_xmin. In
> pg_create_logical_replication_slot(), there is a small window between
> CreateInitDecodingContext-->ReplicationSlotReserveWal() which sets
> restart_lsn and DecodingContextFindStartpoint() which sets
> confirmed_lsn. If the slot-sync worker fetches the slot in this
> window, confirmed_lsn received will be NULL. Corrected the code to
> remove assert and added one additional condition that confirmed_lsn
> should be valid before moving the slot to 'r'.
>

Looking at v53-0002 commit message:

It states:

"
If a logical slot on the primary is valid but is invalidated on the standby,
then that slot is dropped and recreated on the standby in next sync-cycle.
"

and one of the reasons mentioned is:

"
- The primary changes wal_level to a level lower than logical.
"

I think that as long at there is still logical replication slot on the primary
that should not be possible. The primary should fail to start with messages like:

"
2023-12-22 14:06:09.281 UTC [31824] FATAL: logical replication slot "logical_slot" exists, but wal_level < logical
"

Now, if:

- The standby is shutdown
- All the logical replication slots are removed on the primary
- wal_level is set to < logical on the primary and it is restarted

Then when the standby starts, the "synced" slots will be invalidated and later
removed but not re-created on the next sync-cycle (because they don't exist
anymore on the primary).

Worth to reword a bit that part?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Junwang Zhao 2023-12-22 14:37:40 Re: Transaction timeout
Previous Message Japin Li 2023-12-22 14:25:37 Re: Transaction timeout