Re: speed up a logical replica setup

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Euler Taveira <euler(at)eulerto(dot)com>, "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: speed up a logical replica setup
Date: 2024-03-26 19:12:59
Message-ID: 47524383-6fd1-4b13-8c4e-fb918b1d35ba@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/26/24 03:53, Euler Taveira wrote:
> On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
>> ## Analysis for failure 1
>>
>> The failure caused by a time lag between walreceiver finishes and pg_is_in_recovery()
>> returns true.
>>
>> According to the output [1], it seems that the tool failed at wait_for_end_recovery()
>> with the message "standby server disconnected from the primary". Also, lines
>> "redo done at..." and "terminating walreceiver process due to administrator command"
>> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
>>
>> According to the source, we confirm that walreceiver is shut down in
>> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, SharedRecoveryState
>> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return true)
>> at the latter part of StartupXLOG().
>>
>> So, if there is a delay between FinishWalRecovery() and change the state, the check
>> in wait_for_end_recovery() would be failed during the time. Since we allow to miss
>> the walreceiver 10 times and it is checked once per second, the failure occurs if
>> the time lag is longer than 10 seconds.
>>
>> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS larger,
>> but it's not a fundamental solution.
>
> I was expecting that slow hosts might have issues in wait_for_end_recovery().
> As you said it took a lot of steps between FinishWalRecovery() (where
> walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set to
> RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
> WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
> because it already finished the promotion and it is just doing the final
> preparation for the host to become a primary.
>
> /*
> * If it is still in recovery, make sure the target server is
> * connected to the primary so it can receive the required WAL to
> * finish the recovery process. If it is disconnected try
> * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
> */
> res = PQexec(conn,
> "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
> if (PQntuples(res) == 0)
> {
> if (++count > NUM_CONN_ATTEMPTS)
> {
> stop_standby_server(subscriber_dir);
> pg_log_error("standby server disconnected from the primary");
> break;
> }
> }
> else
> count = 0; /* reset counter if it connects again */
>
> This code was add to defend against the death/crash of the target server. There
> are at least 3 options:
>
> (1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this constant
> and I decided to use 10 seconds because even in some slow hosts, this time
> wasn't reached during my tests. It seems I forgot to test the combination of slow
> host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
> seconds as default wait. Maybe that's a reasonable value. I checked the
> 004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
> (culicidae), 18.31s (calliphoridae) and 27.52s (olingo).
> > (2) check if the primary is not running when walreceiver is not
available on the
> target server. Increase the connection attempts iif the primary is not running.
> Hence, the described case doesn't cause an increment on the count variable.
>
> (3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
> protection against the death/crash target server. I explained in a previous
> message that timeout may occur in cases that WAL replay to reach consistent
> state takes more than recovery-timeout seconds.
>
> Option (1) is the easiest fix, however, we can have the same issue again if a
> slow host decides to be even slower, hence, we have to adjust this value again.
> Option (2) interprets the walreceiver absence as a recovery end and if the
> primary server is running it can indicate that the target server is in the
> imminence of the recovery end. Option (3) is not as resilient as the other
> options.
>
> The first patch implements a combination of (1) and (2).
>
>> ## Analysis for failure 2
>>
>> According to [2], the physical replication slot which is specified as primary_slot_name
>> was not used by the walsender process. At that time walsender has not existed.
>>
>> ```
>> ...
>> pg_createsubscriber: publisher: current wal senders: 0
>> pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
>> pg_createsubscriber: error: could not obtain replication slot information: got 0 rows, expected 1 row
>> ...
>> ```
>>
>> Currently standby must be stopped before the command and current code does not
>> block the flow to ensure the replication is started. So there is a possibility
>> that the checking is run before walsender is launched.
>>
>> One possible approach is to wait until the replication starts. Alternative one is
>> to ease the condition.
>
> That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to
> NUM_ATTEMPTS in the first patch). See second patch.
>

Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
waited long enough?

IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
pg_createsubscriber, and that should do the trick.

Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
things like ubsan/valgrind already use to deal with exactly this kind of
timeout problem.

Or is there a deeper problem with deciding if the system is in recovery?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-03-26 19:14:14 Re: archive modules loose ends
Previous Message Nathan Bossart 2024-03-26 19:09:04 Re: add AVX2 support to simd.h