RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-01-09 12:14:04
Message-ID: OS0PR01MB57169DD55EC8D9D1EDB7A0C2946A2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Monday, January 8, 2024 2:10 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Fri, Jan 5, 2024 at 5:45 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Jan 5, 2024 at 4:25 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Jan 5, 2024 at 8:59 AM shveta malik <shveta(dot)malik(at)gmail(dot)com>
> wrote:
> > > >
> > > I was going the the patch set again, I have a question. The below
> > > comments say that we keep the failover option as PENDING until we
> > > have done the initial table sync which seems fine. But what happens
> > > if we add a new table to the publication and refresh the
> > > subscription? In such a case does this go back to the PENDING state or
> something else?
> > >
> >
> > At this stage, such an operation is prohibited. Users need to disable
> > the failover option first, then perform the above operation, and after
> > that failover option can be re-enabled.
>
> Okay, that makes sense to me.

During the off-list discussion, Sawada-san proposed one idea which can release the
restriction for table sync: instead of relying on the latest WAL position, we
can utilize the remote restart_lsn to reserve the WAL when creating a new
synced slot on the standby. This approach eliminates the need to wait for the
primary server to catch up, thus improving the speed of synced slot creation on
the standby in most scenarios.

By using this approach, the limitation that prevents users from performing
table sync during failover can be eliminated. In previous versions, this
restriction existed because table sync slots were often incompletely
synchronized to the standby(the slots on primary could not catch up the synced
slot). And with this approach, the table sync slots can be efficiently
synced to the standby in most cases.

However, there could still be rare cases that the WAL around remote restart_lsn
has been removed on standby, we will try to reserve the last remaining wal in
this case and mark the slot as temporary, these temp slots will be converted to
persistent once the remote restart_lsn catches up.

We think this idea is promising and here is the V58 patch set which tries to
address the idea, the summary of changes for each patch is as follows:

V58-0001

1) Enables failover for table sync slot.
2) Removes the restriction on table sync when failover is enabled.
3) Removes tristate handling for failover state.
4) Renames failoverstate to failover.
5) Address Peter's comments[1].

V58-0002
1) Add the document about how to resume logical replication after failover.
2) Don't sync temporary from primary server anymore.
3) Fix one spinlock miss.
4) Fix one CFbot warning.
5) Fixes a bug where last_update_time is not initialized.
6) Reserves WAL based on the remote restart_lsn.
7) Improves and adjusts the tests.
8) remove the separate function wait_for_primary_slot_catchup() and integrate
its logic of marking the slot as ready into the main loop.
9) remove the 'i' state of sync_state. The slots that need to wait for the
primary to catch up will be marked as TEMPORARY, and they will be converted
to PERSISTENT once the remote restart_lsn catches up.

Thanks Shveta for working on 1) to 4).

V58-0003
Rebases the tests.

V58-0004:
Address Bertrand comments[2]. Thanks Shveta for working on this.

TODO: Add documents to guide user the way to identity if the table sync slot
and the main slot is READY that the logical replication can be resumed by
subscribing to the new primary.

[1] https://www.postgresql.org/message-id/CAHut%2BPvbbPz1%3DT4bzY0_GotUK460Eih41Twjt%3DczJ1z2J8SGEw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/ZZa4pLFCe2mAks1m%40ip-10-97-1-34.eu-west-3.compute.internal

Best Regards,
Hou zj

Attachment Content-Type Size
v58-0004-Non-replication-connection-and-app_name-change.patch application/octet-stream 9.5 KB
v58-0001-Enable-setting-failover-property-for-a-slot-thro.patch application/octet-stream 100.7 KB
v58-0002-Add-logical-slot-sync-capability-to-the-physical.patch application/octet-stream 82.6 KB
v58-0003-Allow-logical-walsenders-to-wait-for-the-physica.patch application/octet-stream 39.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2024-01-09 12:15:46 RE: Synchronizing slots from primary to standby
Previous Message Dean Rasheed 2024-01-09 11:57:14 Re: psql JSON output format