RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-02-15 03:35:34
Message-ID: OS0PR01MB57160EB0B56BC6F328F7ED62944D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, February 15, 2024 10:49 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >
> > On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote:
> > > On Wednesday, February 14, 2024 6:05 PM Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > To ensure that restart_lsn has been moved to a recent position, we
> > > > need to log XLOG_RUNNING_XACTS and make sure the same is processed
> > > > as well by walsender. The attached patch does the required change.
> > > >
> > > > Hou-San can reproduce this problem by adding additional
> > > > checkpoints in the test and after applying the attached it fixes
> > > > the problem. Now, this patch is mostly based on the theory we
> > > > formed based on LOGs on BF and a reproducer by Hou-San, so still,
> > > > there is some chance that this doesn't fix the BF failures in which case I'll
> again look into those.
> > >
> > > I have verified that the patch can fix the issue on my machine(after
> > > adding few more checkpoints before slot invalidation test.) I also
> > > added one more check in the test to confirm the synced slot is not temp slot.
> Here is the v2 patch.
> >
> > Thanks!
> >
> > +# To ensure that restart_lsn has moved to a recent WAL position, we
> > +need # to log XLOG_RUNNING_XACTS and make sure the same is processed
> > +as well $primary->psql('postgres', "CHECKPOINT");
> >
> > Instead of "CHECKPOINT" wouldn't a less heavy "SELECT
> pg_log_standby_snapshot();"
> > be enough?
> >
>
> Yeah, that would be enough. However, the test still fails randomly due to the
> same reason. See [1]. So, as mentioned yesterday, now, I feel it is better to
> recreate the subscription/slot so that it can get the latest restart_lsn rather than
> relying on pg_log_standby_snapshot() to move it.
>
> > Not a big deal but maybe we could do the change while modifying
> > 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync
> worker".
> >
>
> Right, we can do that or probably this test would have made more sense with a
> worker patch where we could wait for the slot to be synced.
> Anyway, let's try to recreate the slot/subscription idea. BTW, do you think that
> adding a LOG when we are not able to sync will help in debugging such
> problems? I think eventually we can change it to DEBUG1 but for now, it can help
> with stabilizing BF and or some other reported issues.

Here is the patch that attempts the re-create sub idea. I also think that a LOG/DEBUG
would be useful for such analysis, so the 0002 is to add such a log.

Best Regards,
Hou zj

Attachment Content-Type Size
0002-Add-a-log-if-remote-slot-didn-t-catch-up-to-locally-.patch application/octet-stream 1.2 KB
0001-fix-BF-error-take-2.patch application/octet-stream 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2024-02-15 03:52:23 Re: About a recently-added message
Previous Message Masahiko Sawada 2024-02-15 03:20:46 Re: [PoC] Improve dead tuple storage for lazy vacuum