Re: Synchronizing slots from primary to standby

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-07 01:13:20
Message-ID: CAD21AoBjHYK_Zhr_3jjjJzP30LdbnMg0D4VtfZZkzucEZuo3ng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 6, 2024 at 8:21 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Feb 6, 2024 at 3:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > I think users can refer to LOGs to see if it has changed since the
> > > > first time it was configured. I tried by existing parameter and see
> > > > the following in LOG:
> > > > LOG: received SIGHUP, reloading configuration files
> > > > 2024-02-06 11:38:59.069 IST [9240] LOG: parameter "autovacuum" changed to "on"
> > > >
> > > > If the user can't confirm then it is better to follow the steps
> > > > mentioned in the patch. Do you want something else to be written in
> > > > docs for this? If so, what?
> > >
> > > IIUC even if a wrong slot name is specified to standby_slot_names or
> > > even standby_slot_names is empty, the standby server might not be
> > > lagging behind the subscribers depending on the timing. But when
> > > checking it the next time, the standby server might lag behind the
> > > subscribers. So what I wanted to know is how the user can confirm if a
> > > failover-enabled subscription is ensured not to go in front of
> > > failover-candidate standbys (i.e., standbys using the slots listed in
> > > standby_slot_names).
> > >
> >
> > But isn't the same explained by two steps ((a) Firstly, on the
> > subscriber node check the last replayed WAL. (b) Next, on the standby
> > server check that the last-received WAL location is ahead of the
> > replayed WAL location on the subscriber identified above.) in the
> > latest *_0004 patch.
> >
>
> Additionally, I would like to add that the users can use the queries
> mentioned in the doc after the primary has failed and before promoting
> the standby. If she wants to do that when both primary and standby are
> available, the value of 'standby_slot_names' on primary should be
> referred. Isn't those two sufficient that there won't be false
> positives?

From a user perspective, I'd like to confirm the following two points :

1. replication slots used by subscribers are synchronized to the standby.
2. it's guaranteed that logical replication doesn't go ahead of
physical replication to the standby.

These checks are necessary at least when building a replication setup
(primary, standby, and subscriber). Otherwise, it's too late if we
find out that no standby is failover-ready when the primary fails and
we're about to do a failover.

As for the point 1 above, we can use the step 1 described in the doc.

As for point 2, the step 2 described in the doc could return true even
if standby_slot_names isn't working. For example, standby_slot_names
is empty, the user changed the standby_slot_names but forgot to reload
the config file, and the walsender doesn't reflect the
standby_slot_names update yet for some reason etc. It's possible that
standby's last-received WAL location just happens to be ahead of the
replayed WAL location on the subscriber. So even if the check query
returns true once, it could return false when we check it again, if
standby_slot_names is not working. On the other hand, IIUC if the
point 2 is ensured, the check query always returns true. I think it
would be good if we could provide a reliable way to check point 2
ideally via SQL queries (especially for tools).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2024-02-07 01:19:03 Re: pgbnech: allow to cancel queries during benchmark
Previous Message Andres Freund 2024-02-06 23:33:36 Re: Make COPY format extendable: Extract COPY TO format implementations