Re: Synchronizing slots from primary to standby

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-14 08:44:11
Message-ID: CAA4eK1JLBi3HzenB6do3_hd78kN0UDD1mz-vumWE52XHHEq5Bw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> Here is V87 patch that adds test for the suggested cases.
>

I have pushed this patch and it leads to a BF failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-14%2004%3A43%3A37

The test failures are:
# Failed test 'logical decoding is not allowed on synced slot'
# at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
line 272.
# Failed test 'synced slot on standby cannot be altered'
# at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
line 281.
# Failed test 'synced slot on standby cannot be dropped'
# at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
line 287.

The reason is that in LOGs, we see a different ERROR message than what
is expected:
2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR:
replication slot "lsub1_slot" is active for PID 1760871

Now, we see the slot still active because a test before these tests (#
Test that if the synchronized slot is invalidated while the remote
slot is still valid, ....) is not able to successfully persist the
slot and the synced temporary slot remains active.

The reason is clear by referring to below standby LOGS:

LOG: connection authorized: user=bf database=postgres
application_name=040_standby_failover_slots_sync.pl
LOG: statement: SELECT pg_sync_replication_slots();
LOG: dropped replication slot "lsub1_slot" of dbid 5
STATEMENT: SELECT pg_sync_replication_slots();
...
SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots
WHERE slot_name = 'lsub1_slot';

In the above LOGs, we should ideally see: "newly created slot
"lsub1_slot" is sync-ready now" after the "LOG: dropped replication
slot "lsub1_slot" of dbid 5" but lack of that means the test didn't
accomplish what it was supposed to. Ideally, the same test should have
failed but the pass criteria for the test failed to check whether the
slot is persisted or not.

The probable reason for failure is that remote_slot's restart_lsn lags
behind the oldest WAL segment on standby. Now, in the test, we do
ensure that the publisher and subscriber are caught up by following
steps:
# Enable the subscription to let it catch up to the latest wal position
$subscriber1->safe_psql('postgres',
"ALTER SUBSCRIPTION regress_mysub1 ENABLE");

$primary->wait_for_catchup('regress_mysub1');

However, this doesn't guarantee that restart_lsn is moved to a
position new enough that standby has a WAL corresponding to it. One
easy fix is to re-create the subscription with the same slot_name
after we have ensured that the slot has been invalidated on standby so
that a new restart_lsn is assigned to the slot but it is better to
analyze some more why the slot's restart_lsn hasn't moved enough only
sometimes.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-02-14 08:46:29 Re: About a recently-added message
Previous Message Hayato Kuroda (Fujitsu) 2024-02-14 08:35:03 RE: speed up a logical replica setup