Re: Synchronizing slots from primary to standby

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-14 13:56:15
Message-ID: ZczGf7tZaD0p8tNk@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote:
> On Wednesday, February 14, 2024 6:05 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu)
> > > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > >
> > > > Here is V87 patch that adds test for the suggested cases.
> > > >
> > >
> > > I have pushed this patch and it leads to a BF failure:
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&d
> > > t=2024-02-14%2004%3A43%3A37
> > >
> > > The test failures are:
> > > # Failed test 'logical decoding is not allowed on synced slot'
> > > # at
> > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> > ailover_slots_sync.pl
> > > line 272.
> > > # Failed test 'synced slot on standby cannot be altered'
> > > # at
> > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> > ailover_slots_sync.pl
> > > line 281.
> > > # Failed test 'synced slot on standby cannot be dropped'
> > > # at
> > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> > ailover_slots_sync.pl
> > > line 287.
> > >
> > > The reason is that in LOGs, we see a different ERROR message than what
> > > is expected:
> > > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR:
> > > replication slot "lsub1_slot" is active for PID 1760871
> > >
> > > Now, we see the slot still active because a test before these tests (#
> > > Test that if the synchronized slot is invalidated while the remote
> > > slot is still valid, ....) is not able to successfully persist the
> > > slot and the synced temporary slot remains active.
> > >
> > > The reason is clear by referring to below standby LOGS:
> > >
> > > LOG: connection authorized: user=bf database=postgres
> > > application_name=040_standby_failover_slots_sync.pl
> > > LOG: statement: SELECT pg_sync_replication_slots();
> > > LOG: dropped replication slot "lsub1_slot" of dbid 5
> > > STATEMENT: SELECT pg_sync_replication_slots(); ...
> > > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots
> > > WHERE slot_name = 'lsub1_slot';
> > >
> > > In the above LOGs, we should ideally see: "newly created slot
> > > "lsub1_slot" is sync-ready now" after the "LOG: dropped replication
> > > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't
> > > accomplish what it was supposed to. Ideally, the same test should have
> > > failed but the pass criteria for the test failed to check whether the
> > > slot is persisted or not.
> > >
> > > The probable reason for failure is that remote_slot's restart_lsn lags
> > > behind the oldest WAL segment on standby. Now, in the test, we do
> > > ensure that the publisher and subscriber are caught up by following
> > > steps:
> > > # Enable the subscription to let it catch up to the latest wal
> > > position $subscriber1->safe_psql('postgres',
> > > "ALTER SUBSCRIPTION regress_mysub1 ENABLE");
> > >
> > > $primary->wait_for_catchup('regress_mysub1');
> > >
> > > However, this doesn't guarantee that restart_lsn is moved to a
> > > position new enough that standby has a WAL corresponding to it.
> > >
> >
> > To ensure that restart_lsn has been moved to a recent position, we need to log
> > XLOG_RUNNING_XACTS and make sure the same is processed as well by
> > walsender. The attached patch does the required change.
> >
> > Hou-San can reproduce this problem by adding additional checkpoints in the
> > test and after applying the attached it fixes the problem. Now, this patch is
> > mostly based on the theory we formed based on LOGs on BF and a reproducer
> > by Hou-San, so still, there is some chance that this doesn't fix the BF failures in
> > which case I'll again look into those.
>
> I have verified that the patch can fix the issue on my machine(after adding few
> more checkpoints before slot invalidation test.) I also added one more check in
> the test to confirm the synced slot is not temp slot. Here is the v2 patch.

Thanks!

+# To ensure that restart_lsn has moved to a recent WAL position, we need
+# to log XLOG_RUNNING_XACTS and make sure the same is processed as well
+$primary->psql('postgres', "CHECKPOINT");

Instead of "CHECKPOINT" wouldn't a less heavy "SELECT pg_log_standby_snapshot();"
be enough?

Not a big deal but maybe we could do the change while modifying
040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync worker".

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-02-14 14:13:12 Re: index prefetching
Previous Message Daniel Gustafsson 2024-02-14 13:35:39 Re: Fix a typo in pg_rotate_logfile