Re: pgsql: Fix the intermittent buildfarm failures in 040_standby_failover_

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <akapila(at)postgresql(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pgsql: Fix the intermittent buildfarm failures in 040_standby_failover_
Date: 2024-04-09 02:07:45
Message-ID: CAA4eK1KgU7=b2eXTw_X5VNfQ-oW-AMkz52q9_aQThXFuCkUcQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On Mon, Apr 8, 2024 at 9:24 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Hi,
>
> I'm concerned that the failover slots feature may not be in
> sufficiently good shape for us to ship it. Since this test file was
> introduced at the end of January, it's been touched by a total of 16
> commits, most of which seem to be trying to get it to pass reliably:
>

Among the 16 commits, there are 6 feature commits (3 of which are
another feature that has interaction with this feature), 1 code
improvement commit, 2 bug fixes commit, and 7 test stabilization
commits. See [1] for the categorization of commits. Now, among these 7
test stabilization commits (which seems to be the main source of your
concern), 4 are due to the reason that we are expecting the slots to
be synced in one function call with pg_sync_replication_slots() which
sometimes didn't happen when there is an unexpected WAL generation say
by bgwriter (XLOG_RUNNING_XACTS ) or an extra XID generation by
auto(analyze). This shouldn't be a problem in practice where users are
expected to use slotsync worker which will keep syncing slots at
regular intervals. All these required stabilizations are in two of the
tests involving the use of the function pg_sync_replication_slots() to
sync slots. We can think of getting rid of this function and relying
only on slotsync worker functionality but I find this function quite
convenient for debugging and in some cases writing targeted tests
(though it caused instability in tests). We can provide more
information in docs for the use of this API.

The other stabilization fixes are as follows: 1 is a Perl scripting
issue to check LOGs, 1 is to increase the DEBUG level to catch more
information for failures, and 1 is a test setup miss which is already
done in other similar tests.

Having said that, I have kept an eye on the reports (-hackers, -bugs,
etc.) related to this feature and if we find that this feature is
inconvenient to use then we should consider either improving it, if
possible, or reverting it.

[1]:
New features:
6f132ed693 Allow synced slots to have their inactive_since.
6ae701b437 Track invalidation_reason in pg_replication_slots.
bf279ddd1c Introduce a new GUC 'standby_slot_names'.
93db6cbda0 Add a new slot sync worker to synchronize logical slots.
ddd5f4f54a Add a slot synchronization function.
776621a5e4 Add a failover option to subscriptions.

Code improvement
801792e528 Improve ERROR/LOG messages added by commits ddd5f4f54a and
7a424ece48.

Bug fixes:
2ec005b4e2 Ensure that the sync slots reach a consistent state after
promotion without losing data.
b3f6b14cf4 Fixups for commit 93db6cbda0.

Stabilize test cases:
def0ce3370 Fix BF failure introduced by commit b3f6b14cf4.
b7bdade6a4 Disable autovacuum on primary in
040_standby_failover_slots_sync test.
d9e225f275 Change the LOG level in 040_standby_failover_slots_sync.pl to DEBUG2.
9bc1eee988 Another try to fix BF failure introduced in commit ddd5f4f54a.
bd8fc1677b Fix BF introduced in commit ddd5f4f54a.
d13ff82319 Fix BF failure in commit 93db6cbda0.
6f3d8d5e7c Fix the intermittent buildfarm failures in
040_standby_failover_slots_sync.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-committers by date

  From Date Subject
Next Message John Naylor 2024-04-09 04:39:19 Re: pgsql: Teach radix tree to embed values at runtime
Previous Message Kyotaro Horiguchi 2024-04-09 01:46:13 Re: pgsql: With gssencmode='require', check credential cache before connect

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-04-09 02:12:16 Re: post-freeze damage control
Previous Message Nathan Bossart 2024-04-09 01:34:12 Re: Fixup some StringInfo usages