Re: pgsql: Improve runtime and output of tests for replication slots checkp

From: Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date: 2025-06-23 11:59:47
Message-ID: CAGECzQTx+UHMfVPgKRL0KqyeKgRoXFAp2LJbcDRTZOhduM-NKg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-committers

On Mon, 23 Jun 2025 at 12:24, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > Yeah, that's what I think too. The unintentional omission of a
> > > pre-shutdown delay in the 046 test has exposed some pre-existing
> > > fragility in pg_logical_slot_get_changes(). So I'm not in favor
> > > of changing 046 till we understand this better.
> >
> > Yes, better to understand what's going on before changing the test,
> > and perhaps change 046 so as it provide stable coverage for this case,
> > even if discovered accidentally.
> >
> > So, is it only pg_logical_slot_get_changes_guts() that's messed up in
> > this context, because XLogDecodeNextRecord() is trying to read pages
> > as it has a logic too permissive? How did any of you reproduce the
> > failure? Just by running the test in a loop? It is one of these
> > patterns where a hardcoded sleep should do the trick and help with a
> > bisect.
>
> Personally I just run the test in the loop. It takes about ~30 times to reproduce. Works both with -O0 and -O2.
>
> cd src/test/recovery
> while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/047_checkpoint_physical_slot.pl" make check; do :; done
>
> > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
> > you expect the checkpoint to complete before sending the immediate
> > shutdown request, I would suggest to use a wait_for_log() based on
> > "checkpoint complete" or an equivalent loop. What you are doing in
> > the test is unstable as written.
>
> Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless, both cases (immediate stop before checkpoint completion, and immediate stop after checkpoint completion) must work without hang.

CFBot has been much more red than usual since this change afaict. Many
more windows builds are failing than usual with an error like this:

[08:28:52.683] 338/338 postgresql:recovery /
recovery/046_checkpoint_logical_slot TIMEOUT 1000.09s exit status 1

How about we revert the commit for now to get CI and the buildfarm green again?

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Alexander Korotkov 2025-06-23 12:31:25 Re: pgsql: Improve runtime and output of tests for replication slots checkp
Previous Message John Naylor 2025-06-23 11:05:59 pgsql: Properly fix AVX-512 CRC calculation bug