RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-03-29 01:06:15
Message-ID: OS0PR01MB571615D35F486080616CA841943A2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, March 28, 2024 10:02 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Thursday, March 28, 2024 7:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu)
> > <houzj(dot)fnst(at)fujitsu(dot)com>
> > wrote:
> > >
> > > When analyzing one BF error[1], we find an issue of slotsync: Since
> > > we don't perform logical decoding for the synced slots when syncing
> > > the lsn/xmin of slot, no logical snapshots will be serialized to
> > > disk. So, when user starts to use these synced slots after
> > > promotion, it needs to re-build the consistent snapshot from the
> > > restart_lsn if the
> > > WAL(xl_running_xacts) at restart_lsn position indicates that there
> > > are running transactions. This however could cause the data that
> > > before the
> > consistent point to be missed[2].
> > >
> > > This issue doesn't exist on the primary because the snapshot at
> > > restart_lsn should have been serialized to disk
> > > (SnapBuildProcessRunningXacts -> SnapBuildSerialize), so even if the
> > > logical decoding restarts, it can find consistent snapshot
> > > immediately at
> > restart_lsn.
> > >
> > > To fix this, we could use the fast forward logical decoding to
> > > advance the synced slot's lsn/xmin when syncing these values instead
> > > of directly updating the slot's info. This way, the snapshot will be
> > > serialized to
> > disk when decoding.
> > > If we could not reach to the consistent point at the remote
> > > restart_lsn, the slot is marked as temp and will be persisted once
> > > it reaches the consistent point. I am still analyzing the fix and
> > > will share once
> > ready.
> > >
> >
> > Yes, we can use this but one thing to note is that
> > CreateDecodingContext() will expect that the slot's and current
> > database are the same. I think the reason for that is we need to check
> > system tables of the current database while decoding and sending data
> > to the output_plugin which won't be a requirement for the fast_forward
> > case. So, we need to skip that check in fast_forward mode.
>
> Agreed.
>
> >
> > Next, I was thinking about the case of the first time updating the
> > restart and confirmed_flush LSN while syncing the slots. I think we
> > can keep the current logic as it is based on the following analysis.
> >
> > For each logical slot, cases possible on the primary:
> > 1. The restart_lsn doesn't have a serialized snapshot and hasn't yet
> > reached the consistent point.
> > 2. The restart_lsn doesn't have a serialized snapshot but has reached
> > a consistent point.
> > 3. The restart_lsn has a serialized snapshot which means it has
> > reached a consistent point as well.
> >
> > Considering we keep the logic to reserve initial WAL positions the
> > same as the current (Reserve WAL for the currently active local slot
> > using the specified WAL location (restart_lsn). If the given WAL
> > location has been removed, reserve WAL using the oldest existing WAL
> > segment.), I could think of the below
> > scenarios:
> > A. For 1, we shouldn't sync the slot as it still wouldn't have been
> > marked persistent on the primary.
> > B. For 2, we would sync the slot
> > B1. If remote_restart_lsn >= local_resart_lsn, then advance the
> > slot by calling pg_logical_replication_slot_advance().
> > B11. If we reach consistent point, then it should be okay
> > because after promotion as well we should reach consistent point.
> > B111. But again is it possible that there is some xact
> > that comes before consistent_point on primary and the same is after
> > consistent_point on standby? This shouldn't matter as we will start
> > decoding transactions after confirmed_flush_lsn which would be the same on
> primary and standby.
> > B22. If we haven't reached consistent_point, then we won't mark
> > the slot as persistent, and at the next sync we will do the same till
> > it reaches consistent_point. At that time, the situation will be similar to B11.
> > B2. If remote_restart_lsn < local_restart_lsn, then we will wait
> > for the next sync cycle and keep the slot as temporary. Once in the
> > next or some consecutive sync cycle, we reach the condition
> > remote_restart_lsn >= local_restart_lsn, we will proceed to advance
> > the slot and we should have the same behavior as B1.
> > C. For 3, we would sync the slot, but the behavior should be the same as B.
> >
> > Thoughts?
>
> Looks reasonable to me.
>
> Here is the patch based on above lines.
> I am also testing and verifying the patch locally.

Attach a new version patch which fixed an un-initialized variable issue and
added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so that
we can analyze the possible CFbot failures easily.

Best Regards,
Hou zj

Attachment Content-Type Size
v2-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch application/octet-stream 10.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-03-29 01:12:48 Re: BitmapHeapScan streaming read user and prelim refactoring
Previous Message Erik Wienhold 2024-03-29 00:38:17 Re: PSQL Should \sv & \ev work with materialized views?