Re: 004_timeline_switch TAP test may fail

From: Sergey Tatarintsev <s(dot)tatarintsev(at)postgrespro(dot)ru>
To: Ewan Young <kdbase(dot)hack(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: 004_timeline_switch TAP test may fail
Date: 2026-06-17 00:53:12
Message-ID: f3ce2134-d72b-489b-aba5-f8c012c7708e@postgrespro.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!
Thanks for review!
v2 patch attached
comment changed, count(1) replaced with count(*).

> Hi Sergey,
>
> Thanks for the report and patch. I think the analysis is right, and the
> fix is in the right place.
>
> The gap traces back to commit 7185eddf, which deliberately dropped the
> wait_for_catchup() and switched the primary from teardown_node() to a
> clean stop(), on the grounds that a clean stop flushes all WAL to both
> standbys before exiting. That's true, but only for standbys whose
> walsender is *connected* at shutdown time -- and ->start() only waits
> for the postmaster to accept connections, not for the standby's
> walreceiver to have connected back to the primary. So if a standby
> hasn't connected yet when the primary stops, the clean-shutdown flush
> skips it, and we're back to the exact "standbys received different
> amounts of WAL -> timeline fork on reconnect" failure that 7185eddf was
> meant to fix.
>
> Polling pg_stat_replication until both walsenders are present closes
> that hole: it re-establishes the precondition the clean-stop design
> silently assumed. And connection is enough here -- the walsender
> shutdown path sends all WAL up to the shutdown checkpoint regardless of
> catchup state -- so there's no need to additionally check
> state = 'streaming'.
>
> One small thing: the rest of this file uses count(*), so I'd write count(*) = 2
> rather than count(1) = 2 just for local consistency. And the comment reads a
> little better as something like "Wait until both standbys have
> connected to the primary",
> since by this point they've already started -- what we're waiting for is the
> connection.
>
> Regards,
> Ewan
>
> On Tue, Jun 16, 2026 at 4:01 PM Sergey Tatarintsev
> <s(dot)tatarintsev(at)postgrespro(dot)ru> wrote:
>> Hi hackers!
>>
>> I found that after commit 7185eddf0522b3146ed1ff6e063e8e129e77c706 we
>> got little omission
>> in TAP test 004_timeline_switch:
>> ...
>> my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
>> ...
>> $node_primary->stop;
>>
>> There is no guarantee that standby_1 and standby_2 was successfully
>> connected to primary and start
>> streaming before primary stopped.
>>
>> I think we must ensure that primary knows about standby_1 and standby_2
>>
>> --
>> With best regards,
>> Sergey Tatarintsev,
>> PostgresPro

--
With best regards,
Sergey Tatarintsev,
PostgresPro

Attachment Content-Type Size
v2-0001-Fix-004_timeline_switch-TAP-test-wait-for-standbys-s.patch text/x-patch 1.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2026-06-17 01:17:14 Re: IGNORE/RESPECT NULLS can be specified for (prokind == 'f').
Previous Message jian he 2026-06-17 00:34:22 Re: IGNORE/RESPECT NULLS can be specified for (prokind == 'f').