Re: pgsql: Fix the intermittent buildfarm failures in 040_standby_failover_

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <akapila(at)postgresql(dot)org>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pgsql: Fix the intermittent buildfarm failures in 040_standby_failover_
Date: 2024-04-08 15:53:48
Message-ID: CA+TgmoaA4oufUBR5B-4o83rnwGZ3zAA5UvwxDX=NjCm1TVgRsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Mon, Apr 8, 2024 at 4:04 AM Amit Kapila <akapila(at)postgresql(dot)org> wrote:
> Fix the intermittent buildfarm failures in 040_standby_failover_slots_sync.
>
> It is possible that even if the primary waits for the subscriber to catch
> up and then disables the subscription, the XLOG_RUNNING_XACTS record gets
> inserted between the two steps by bgwriter and walsender processes it.
> This can move the restart_lsn of the corresponding slot in an
> unpredictable way which further leads to slot sync failure.
>
> To ensure predictable behaviour, we drop the subscription and manually
> create the slot before the test. The other idea we discussed to write a
> predictable test is to use injection points to control the bgwriter
> logging XLOG_RUNNING_XACTS but that needs more analysis. We can add a
> separate test using injection points.

Hi,

I'm concerned that the failover slots feature may not be in
sufficiently good shape for us to ship it. Since this test file was
introduced at the end of January, it's been touched by a total of 16
commits, most of which seem to be trying to get it to pass reliably:

6f3d8d5e7c Fix the intermittent buildfarm failures in
040_standby_failover_slots_sync.
6f132ed693 Allow synced slots to have their inactive_since.
2ec005b4e2 Ensure that the sync slots reach a consistent state after
promotion without losing data.
6ae701b437 Track invalidation_reason in pg_replication_slots.
bf279ddd1c Introduce a new GUC 'standby_slot_names'.
def0ce3370 Fix BF failure introduced by commit b3f6b14cf4.
b3f6b14cf4 Fixups for commit 93db6cbda0.
d13ff82319 Fix BF failure in commit 93db6cbda0.
93db6cbda0 Add a new slot sync worker to synchronize logical slots.
801792e528 Improve ERROR/LOG messages added by commits ddd5f4f54a and
7a424ece48.
b7bdade6a4 Disable autovacuum on primary in
040_standby_failover_slots_sync test.
d9e225f275 Change the LOG level in 040_standby_failover_slots_sync.pl to DEBUG2.
9bc1eee988 Another try to fix BF failure introduced in commit ddd5f4f54a.
bd8fc1677b Fix BF introduced in commit ddd5f4f54a.
ddd5f4f54a Add a slot synchronization function.
776621a5e4 Add a failover option to subscriptions.

It's not really the test failures themselves that concern me here, so
much as the possibility of users who try to make use of this feature
having a similar amount of difficulty getting it to work reliably. The
test case seems to be taking more and more elaborate precautions to
prevent incidental things from breaking the feature. But users won't
like this feature very much if they have to take elaborate precautions
to get it to work in the real world. Is there a reason to believe that
all of this stabilization work is addressing artificial problems that
won't inconvenience real users, or should we be concerned that the
feature itself is going to be difficult to use effectively?

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Andres Freund 2024-04-08 17:27:42 Re: pgsql: Teach radix tree to embed values at runtime
Previous Message Jelte Fennema-Nio 2024-04-08 15:14:52 Re: pgsql: Transform OR clauses to ANY expression

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Borisov 2024-04-08 15:56:44 Re: PostgreSQL 17 Release Management Team & Feature Freeze
Previous Message Matthias van de Meent 2024-04-08 15:48:37 Re: PostgreSQL 17 Release Management Team & Feature Freeze