Re: Synchronizing slots from primary to standby

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-14 10:05:03
Message-ID: CAA4eK1+Jn=W6_XVxq2gG+fWX9a8iHa0DU0pcPcb41UAejUZ0rQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > Here is V87 patch that adds test for the suggested cases.
> >
>
> I have pushed this patch and it leads to a BF failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-14%2004%3A43%3A37
>
> The test failures are:
> # Failed test 'logical decoding is not allowed on synced slot'
> # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
> line 272.
> # Failed test 'synced slot on standby cannot be altered'
> # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
> line 281.
> # Failed test 'synced slot on standby cannot be dropped'
> # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
> line 287.
>
> The reason is that in LOGs, we see a different ERROR message than what
> is expected:
> 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR:
> replication slot "lsub1_slot" is active for PID 1760871
>
> Now, we see the slot still active because a test before these tests (#
> Test that if the synchronized slot is invalidated while the remote
> slot is still valid, ....) is not able to successfully persist the
> slot and the synced temporary slot remains active.
>
> The reason is clear by referring to below standby LOGS:
>
> LOG: connection authorized: user=bf database=postgres
> application_name=040_standby_failover_slots_sync.pl
> LOG: statement: SELECT pg_sync_replication_slots();
> LOG: dropped replication slot "lsub1_slot" of dbid 5
> STATEMENT: SELECT pg_sync_replication_slots();
> ...
> SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots
> WHERE slot_name = 'lsub1_slot';
>
> In the above LOGs, we should ideally see: "newly created slot
> "lsub1_slot" is sync-ready now" after the "LOG: dropped replication
> slot "lsub1_slot" of dbid 5" but lack of that means the test didn't
> accomplish what it was supposed to. Ideally, the same test should have
> failed but the pass criteria for the test failed to check whether the
> slot is persisted or not.
>
> The probable reason for failure is that remote_slot's restart_lsn lags
> behind the oldest WAL segment on standby. Now, in the test, we do
> ensure that the publisher and subscriber are caught up by following
> steps:
> # Enable the subscription to let it catch up to the latest wal position
> $subscriber1->safe_psql('postgres',
> "ALTER SUBSCRIPTION regress_mysub1 ENABLE");
>
> $primary->wait_for_catchup('regress_mysub1');
>
> However, this doesn't guarantee that restart_lsn is moved to a
> position new enough that standby has a WAL corresponding to it.
>

To ensure that restart_lsn has been moved to a recent position, we
need to log XLOG_RUNNING_XACTS and make sure the same is processed as
well by walsender. The attached patch does the required change.

Hou-San can reproduce this problem by adding additional checkpoints in
the test and after applying the attached it fixes the problem. Now,
this patch is mostly based on the theory we formed based on LOGs on BF
and a reproducer by Hou-San, so still, there is some chance that this
doesn't fix the BF failures in which case I'll again look into those.

--
With Regards,
Amit Kapila.

Attachment Content-Type Size
fix_040_standby_failover_slots_sync.1.patch application/octet-stream 860 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-02-14 10:21:08 Can we include capturing logs of pgdata/pg_upgrade_output.d/*/log in buildfarm
Previous Message Amit Kapila 2024-02-14 08:46:29 Re: About a recently-added message