From: | Ajin Cherian <itsajin(at)gmail(dot)com> |
---|---|
To: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
Cc: | PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Date: | 2025-08-06 02:03:28 |
Message-ID: | CAFPTHDY6kCU2CJb87Qwy8JVjPBbr8C9eFYT7sKJ2ZafYMC=xng@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Aug 1, 2025 at 4:32 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
>
> Thanks for the patch. I tested it, please find a few comments:
>
>
> 1)
> it hits an assert
> (slotsync_reread_config()-->Assert(sync_replication_slots)) when API
> is trying to sync and is in wait loop while in another session, I
> enable sync_replication_slots using:
>
> ALTER SYSTEM SET sync_replication_slots = 'on';
> SELECT pg_reload_conf();
>
> Assert:
> 025-08-01 10:55:43.637 IST [118576] STATEMENT: SELECT
> pg_sync_replication_slots();
> 2025-08-01 10:55:51.730 IST [118563] LOG: received SIGHUP, reloading
> configuration files
> 2025-08-01 10:55:51.731 IST [118563] LOG: parameter
> "sync_replication_slots" changed to "on"
> TRAP: failed Assert("sync_replication_slots"), File: "slotsync.c",
> Line: 1334, PID: 118576
> postgres: shveta postgres [local]
> SELECT(ExceptionalCondition+0xbb)[0x61df0160e090]
> postgres: shveta postgres [local] SELECT(+0x6520dc)[0x61df0133a0dc]
> 2025-08-01 10:55:51.739 IST [118666] ERROR: cannot synchronize
> replication slots concurrently
> postgres: shveta postgres [local] SELECT(+0x6522b2)[0x61df0133a2b2]
> postgres: shveta postgres [local] SELECT(+0x650664)[0x61df01338664]
> postgres: shveta postgres [local] SELECT(+0x650cf8)[0x61df01338cf8]
> postgres: shveta postgres [local] SELECT(+0x6513ea)[0x61df013393ea]
> postgres: shveta postgres [local] SELECT(+0x6519df)[0x61df013399df]
> postgres: shveta postgres [local]
> SELECT(SyncReplicationSlots+0xbb)[0x61df0133af60]
> postgres: shveta postgres [local]
> SELECT(pg_sync_replication_slots+0x1b1)[0x61df01357e52]
>
Fixed.
> 2)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot synchronize replication slots when"
> + " standby promotion is ongoing")));
>
> I think better error message will be:
> "exiting from slot synchronization as promotion is triggered"
>
> This will be better suited in log file as well after below wait statements:
> LOG: continuing to wait for remote slot "failover_slot" LSN
> (0/3000060) and catalog xmin (755) to pass local slot LSN (0/3000060)
> and catalog xmin (757)
> STATEMENT: SELECT pg_sync_replication_slots();
>
Fixed.
> 3)
> API dumps this when it is waiting for primary:
>
> ----
> LOG: could not synchronize replication slot "failover_slot2"
> DETAIL: Synchronization could lead to data loss, because the remote
> slot needs WAL at LSN 0/03066E70 and catalog xmin 755, but the standby
> has LSN 0/03066E70 and catalog xmin 770.
> STATEMENT: SELECT pg_sync_replication_slots();
> LOG: waiting for remote slot "failover_slot2" LSN (0/3066E70) and
> catalog xmin (755) to pass local slot LSN (0/3066E70) and catalog xmin
> (770)
> STATEMENT: SELECT pg_sync_replication_slots();
> LOG: continuing to wait for remote slot "failover_slot2" LSN
> (0/3066E70) and catalog xmin (755) to pass local slot LSN (0/3066E70)
> and catalog xmin (770)
> STATEMENT: SELECT pg_sync_replication_slots();
> ----
>
> Unsure if we shall still dump 'could not synchronize..' when it is
> going to retry until it succeeds? The concerned log gives a feeling
> that we are done trying and could not synchronize it. What do you
> think?
>
I've modified the log to now say, "initial sync of replication slot
\"%s\" failed; will keep retrying"
On Fri, Aug 1, 2025 at 7:20 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
>
> A few more comments:
>
> 4)
>
> + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
> + {
> + ereport(WARNING,
> + errmsg("aborting initial sync for slot \"%s\"",
> + remote_slot->name),
> + errdetail("This slot was not found on the primary server."));
> +
> + pfree(cmd.data);
> + walrcv_clear_result(res);
> +
> + return false;
> + }
>
> We may have 'aborting sync for slot', can remove 'initial'.
Fixed.
>
> 5)
> I tried a test where there were 4 slots on the publisher, where one
> was getting used while the others were not. Initiated
> pg_sync_replication_slots on standby. Forced unused slots to be
> invalidated by setting idle_replication_slot_timeout=60 on primary,
> due to which API finished but gave a warning:
>
> postgres=# SELECT pg_sync_replication_slots();
> WARNING: aborting initial sync for slot "failover_slot"
> DETAIL: This slot was invalidated on the primary server.
> WARNING: aborting initial sync for slot "failover_slot2"
> DETAIL: This slot was invalidated on the primary server.
> WARNING: aborting initial sync for slot "failover_slot3"
> DETAIL: This slot was invalidated on the primary server.
> pg_sync_replication_slots
> ---------------------------
>
> (1 row)
>
> Do we need these warnings here? I think we can have it as a LOG rather
> than having it on console. Thoughts?
>
> If we inclined towards WARNING here, will ti be better to have it as
> a single line:
>
> WARNING: aborting sync for slot "failover_slot" as the slot was
> invalidated on primary
> WARNING: aborting sync for slot "failover_slot1" as the slot was
> invalidated on primary
> WARNING: aborting sync for slot "failover_slot2" as the slot was
> invalidated on primary
>
I've changed it to LOG now.
>
> 6)
> - * We do not drop the slot because the restart_lsn can be ahead of the
> - * current location when recreating the slot in the next cycle. It may
> - * take more time to create such a slot. Therefore, we keep this slot
> - * and attempt the synchronization in the next cycle.
> + * If we're in the slotsync worker, we retain the slot and retry in the
> + * next cycle. The restart_lsn might advance by then, allowing the slot
> + * to be created successfully later.
> */
>
> I like the previous command better as it was conveying the side-effect
> of dropping the slot herer. Can we try to incorporate the previous
> comment here and make it specific to slotsync workers?
>
Reverted to the previous version.
Attaching patch v4 which addresses these comments.
regards,
Ajin Cherian
Fujitsu Australia
Attachment | Content-Type | Size |
---|---|---|
v4-0001-Improve-initial-slot-synchronization-in-pg_sync_r.patch | application/octet-stream | 17.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Ajin Cherian | 2025-08-06 02:05:11 | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Previous Message | Michael Paquier | 2025-08-06 01:30:54 | Re: Dead code with short varlenas in toast_save_datum() |