Re: 004_timeline_switch TAP test may fail

From: Ewan Young <kdbase(dot)hack(at)gmail(dot)com>
To: Sergey Tatarintsev <s(dot)tatarintsev(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: 004_timeline_switch TAP test may fail
Date: 2026-06-16 10:51:48
Message-ID: CAON2xHMO2EVsjR+6ab+w6BW5ycyveK5UD5nCik+Uc1b=G6pXiA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Sergey,

Thanks for the report and patch. I think the analysis is right, and the
fix is in the right place.

The gap traces back to commit 7185eddf, which deliberately dropped the
wait_for_catchup() and switched the primary from teardown_node() to a
clean stop(), on the grounds that a clean stop flushes all WAL to both
standbys before exiting. That's true, but only for standbys whose
walsender is *connected* at shutdown time -- and ->start() only waits
for the postmaster to accept connections, not for the standby's
walreceiver to have connected back to the primary. So if a standby
hasn't connected yet when the primary stops, the clean-shutdown flush
skips it, and we're back to the exact "standbys received different
amounts of WAL -> timeline fork on reconnect" failure that 7185eddf was
meant to fix.

Polling pg_stat_replication until both walsenders are present closes
that hole: it re-establishes the precondition the clean-stop design
silently assumed. And connection is enough here -- the walsender
shutdown path sends all WAL up to the shutdown checkpoint regardless of
catchup state -- so there's no need to additionally check
state = 'streaming'.

One small thing: the rest of this file uses count(*), so I'd write count(*) = 2
rather than count(1) = 2 just for local consistency. And the comment reads a
little better as something like "Wait until both standbys have
connected to the primary",
since by this point they've already started -- what we're waiting for is the
connection.

Regards,
Ewan

On Tue, Jun 16, 2026 at 4:01 PM Sergey Tatarintsev
<s(dot)tatarintsev(at)postgrespro(dot)ru> wrote:
>
> Hi hackers!
>
> I found that after commit 7185eddf0522b3146ed1ff6e063e8e129e77c706 we
> got little omission
> in TAP test 004_timeline_switch:
> ...
> my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
> ...
> $node_primary->stop;
>
> There is no guarantee that standby_1 and standby_2 was successfully
> connected to primary and start
> streaming before primary stopped.
>
> I think we must ensure that primary knows about standby_1 and standby_2
>
> --
> With best regards,
> Sergey Tatarintsev,
> PostgresPro

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Sharma 2026-06-16 10:52:13 Re: pg_stat_replication docs incomplete for logical replication
Previous Message Nisha Moond 2026-06-16 10:44:09 Re: Support EXCEPT for TABLES IN SCHEMA publications