| From: | Ewan Young <kdbase(dot)hack(at)gmail(dot)com> |
|---|---|
| To: | Sergey Tatarintsev <s(dot)tatarintsev(at)postgrespro(dot)ru> |
| Cc: | pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: 004_timeline_switch TAP test may fail |
| Date: | 2026-06-16 10:51:48 |
| Message-ID: | CAON2xHMO2EVsjR+6ab+w6BW5ycyveK5UD5nCik+Uc1b=G6pXiA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Sergey,
Thanks for the report and patch. I think the analysis is right, and the
fix is in the right place.
The gap traces back to commit 7185eddf, which deliberately dropped the
wait_for_catchup() and switched the primary from teardown_node() to a
clean stop(), on the grounds that a clean stop flushes all WAL to both
standbys before exiting. That's true, but only for standbys whose
walsender is *connected* at shutdown time -- and ->start() only waits
for the postmaster to accept connections, not for the standby's
walreceiver to have connected back to the primary. So if a standby
hasn't connected yet when the primary stops, the clean-shutdown flush
skips it, and we're back to the exact "standbys received different
amounts of WAL -> timeline fork on reconnect" failure that 7185eddf was
meant to fix.
Polling pg_stat_replication until both walsenders are present closes
that hole: it re-establishes the precondition the clean-stop design
silently assumed. And connection is enough here -- the walsender
shutdown path sends all WAL up to the shutdown checkpoint regardless of
catchup state -- so there's no need to additionally check
state = 'streaming'.
One small thing: the rest of this file uses count(*), so I'd write count(*) = 2
rather than count(1) = 2 just for local consistency. And the comment reads a
little better as something like "Wait until both standbys have
connected to the primary",
since by this point they've already started -- what we're waiting for is the
connection.
Regards,
Ewan
On Tue, Jun 16, 2026 at 4:01 PM Sergey Tatarintsev
<s(dot)tatarintsev(at)postgrespro(dot)ru> wrote:
>
> Hi hackers!
>
> I found that after commit 7185eddf0522b3146ed1ff6e063e8e129e77c706 we
> got little omission
> in TAP test 004_timeline_switch:
> ...
> my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
> ...
> $node_primary->stop;
>
> There is no guarantee that standby_1 and standby_2 was successfully
> connected to primary and start
> streaming before primary stopped.
>
> I think we must ensure that primary knows about standby_1 and standby_2
>
> --
> With best regards,
> Sergey Tatarintsev,
> PostgresPro
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ashutosh Sharma | 2026-06-16 10:52:13 | Re: pg_stat_replication docs incomplete for logical replication |
| Previous Message | Nisha Moond | 2026-06-16 10:44:09 | Re: Support EXCEPT for TABLES IN SCHEMA publications |