From: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
---|---|
To: | Ajin Cherian <itsajin(at)gmail(dot)com> |
Cc: | PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com> |
Subject: | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Date: | 2025-07-02 09:55:56 |
Message-ID: | CAJpy0uBf=tY7HZAtBfAFWvFVtVbsNtehJ6s34w_KGDHUHoFKZA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jun 24, 2025 at 4:11 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
>
> Hello,
>
> Creating this thread for a POC based on discussions in thread [1].
> Hou-san had created this patch, and I just cleaned up some documents,
> did some testing and now sharing the patch here.
>
> In this patch, the pg_sync_replication_slots() API now waits
> indefinitely for the remote slot to catch up. We could later add a
> timeout parameter to control maximum wait time if this approach seems
> acceptable. If there are more ideas on improving this patch, let me
> know.
+1 on the idea.
I believe the timeout option may not be necessary here, since the API
can be manually canceled if needed. Otherwise, the recommended
approach is to let it complete. But I would like to know what others
think here.
Few comments:
1)
When the API is waiting for the primary to advance, standby fails to
handle promotion requests. Promotion fails:
./pg_ctl -D ../../standbydb/ promote -w
waiting for server to promote.................stopped waiting
pg_ctl: server did not promote in time
See the logs at [1]
2)
Also when the API is waiting for a long time, it just dumps the
'waiting for remote_slot..' LOG only once. Do you think it makes sense
to log it at a regular interval until the wait is over? See logs at
[1]. It dumped the log once in 3minutes.
3)
+ /*
+ * It is possible to get null value for restart_lsn if the slot is
+ * invalidated on the primary server, so handle accordingly.
+ */
+ if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn))
+ {
+ /*
+ * The slot won't be persisted by the caller; it will be cleaned up
+ * at the end of synchronization.
+ */
+ ereport(WARNING,
+ errmsg("aborting initial sync for slot \"%s\"",
+ remote_slot->name),
+ errdetail("This slot was invalidated on the primary server."));
Which case are we referring to here where null restart_lsn would mean
invalidation? Can you please point me to such code where it happens or
a test-case which does that. I tried a few invalidation cases, but did
not hit it.
[1]:
Log file:
2025-07-02 14:38:09.851 IST [153187] LOG: waiting for remote slot
"failover_slot" LSN (0/3003F60) and catalog xmin (754) to pass local
slot LSN (0/3003F60) and catalog xmin (767)
2025-07-02 14:38:09.851 IST [153187] STATEMENT: SELECT
pg_sync_replication_slots();
2025-07-02 14:41:36.200 IST [153164] LOG: received promote request
thanks
Shveta
From | Date | Subject | |
---|---|---|---|
Next Message | Aleksander Alekseev | 2025-07-02 10:02:19 | Re: Huge commitfest app update upcoming: Tags, Draft CF, Help page, and automated commitfest creat/open/close |
Previous Message | Daniel Gustafsson | 2025-07-02 09:52:09 | Re: Fix inconsistency in the pg_buffercache documentation |