Re: Synchronizing slots from primary to standby

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-15 02:48:59
Message-ID: CAA4eK1+d5Lne8vCAn0un4SP9x-ZBr2-xfxg01uSfeBTSCKFZoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote:
> > On Wednesday, February 14, 2024 6:05 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > To ensure that restart_lsn has been moved to a recent position, we need to log
> > > XLOG_RUNNING_XACTS and make sure the same is processed as well by
> > > walsender. The attached patch does the required change.
> > >
> > > Hou-San can reproduce this problem by adding additional checkpoints in the
> > > test and after applying the attached it fixes the problem. Now, this patch is
> > > mostly based on the theory we formed based on LOGs on BF and a reproducer
> > > by Hou-San, so still, there is some chance that this doesn't fix the BF failures in
> > > which case I'll again look into those.
> >
> > I have verified that the patch can fix the issue on my machine(after adding few
> > more checkpoints before slot invalidation test.) I also added one more check in
> > the test to confirm the synced slot is not temp slot. Here is the v2 patch.
>
> Thanks!
>
> +# To ensure that restart_lsn has moved to a recent WAL position, we need
> +# to log XLOG_RUNNING_XACTS and make sure the same is processed as well
> +$primary->psql('postgres', "CHECKPOINT");
>
> Instead of "CHECKPOINT" wouldn't a less heavy "SELECT pg_log_standby_snapshot();"
> be enough?
>

Yeah, that would be enough. However, the test still fails randomly due
to the same reason. See [1]. So, as mentioned yesterday, now, I feel
it is better to recreate the subscription/slot so that it can get the
latest restart_lsn rather than relying on pg_log_standby_snapshot() to
move it.

> Not a big deal but maybe we could do the change while modifying
> 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync worker".
>

Right, we can do that or probably this test would have made more sense
with a worker patch where we could wait for the slot to be synced.
Anyway, let's try to recreate the slot/subscription idea. BTW, do you
think that adding a LOG when we are not able to sync will help in
debugging such problems? I think eventually we can change it to DEBUG1
but for now, it can help with stabilizing BF and or some other
reported issues.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-15%2000%3A14%3A38

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-02-15 02:56:18 Re: About a recently-added message
Previous Message Michael Paquier 2024-02-15 01:54:10 Re: Can we include capturing logs of pgdata/pg_upgrade_output.d/*/log in buildfarm