Re: pgsql: Improve runtime and output of tests for replication slots checkp

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date: 2025-06-25 16:52:02
Message-ID: CALDaNm1avMmyUCryYHYRjXDFXaxg7J-EknmqXkLXbPiLW9Fi3A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-committers

On Tue, 24 Jun 2025 at 00:20, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> On Mon, Jun 23, 2025 at 4:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Mon, Jun 23, 2025 at 6:01 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Jun 23, 2025 at 3:00 PM Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> > > > On Mon, 23 Jun 2025 at 12:24, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> > > > > On Mon, Jun 23, 2025 at 3:29 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > > > > > Yeah, that's what I think too. The unintentional omission of a
> > > > > > > pre-shutdown delay in the 046 test has exposed some pre-existing
> > > > > > > fragility in pg_logical_slot_get_changes(). So I'm not in favor
> > > > > > > of changing 046 till we understand this better.
> > > > > >
> > > > > > Yes, better to understand what's going on before changing the test,
> > > > > > and perhaps change 046 so as it provide stable coverage for this case,
> > > > > > even if discovered accidentally.
> > > > > >
> > > > > > So, is it only pg_logical_slot_get_changes_guts() that's messed up in
> > > > > > this context, because XLogDecodeNextRecord() is trying to read pages
> > > > > > as it has a logic too permissive? How did any of you reproduce the
> > > > > > failure? Just by running the test in a loop? It is one of these
> > > > > > patterns where a hardcoded sleep should do the trick and help with a
> > > > > > bisect.
> > > > >
> > > > > Personally I just run the test in the loop. It takes about ~30 times to reproduce. Works both with -O0 and -O2.
> > > > >
> > > > > cd src/test/recovery
> > > > > while PROVE_TESTS="t/046_checkpoint_logical_slot.pl t/047_checkpoint_physical_slot.pl" make check; do :; done
> > > > >
> > > > > > By the way, At the bottom of test 046_checkpoint_logical_slot.pl, if
> > > > > > you expect the checkpoint to complete before sending the immediate
> > > > > > shutdown request, I would suggest to use a wait_for_log() based on
> > > > > > "checkpoint complete" or an equivalent loop. What you are doing in
> > > > > > the test is unstable as written.
> > > > >
> > > > > Exactly. I've proposed the fix with wait_for_log() in [1]. Nevertheless, both cases (immediate stop before checkpoint completion, and immediate stop after checkpoint completion) must work without hang.
> > > >
> > > > CFBot has been much more red than usual since this change afaict. Many
> > > > more windows builds are failing than usual with an error like this:
> > > >
> > > > [08:28:52.683] 338/338 postgresql:recovery /
> > > > recovery/046_checkpoint_logical_slot TIMEOUT 1000.09s exit status 1
> > > >
> > > > How about we revert the commit for now to get CI and the buildfarm green again?
> > >
> > > What about removing the 046_checkpoint_logical_slot which currently
> > > causes all the buzz? I'm not yet convinced we need to revert
> > > ca307d5cec90. Opinions?
> > >
> >
> > If we decide to revert/remove anything, it is better to remove
> > 046_checkpoint_logical_slot and keep investigating the issue. As per
> > the information available at this point, it appears to be a base code
> > bug accidentally discovered by this test case. OTOH, removing this
> > test has a risk that there could be a delay in finding the root cause
> > of the issue.
>
> I decided to remove the test while we're investigating the issue. It
> might take a bit longer for us to fix, but that wouldn't distort
> others' work.

I have analyzed this issue further and started a new thread for this at [1].
[1] - https://www.postgresql.org/message-id/CALDaNm34m36PDHzsU_GdcNXU0gLTfFY5rzh9GSQv%3Dw6B%2BQVNRQ%40mail.gmail.com

Regards,
Vignesh

In response to

Browse pgsql-committers by date

  From Date Subject
Next Message Richard Guo 2025-06-26 03:18:32 pgsql: Expand virtual generated columns for ALTER COLUMN TYPE
Previous Message Tomas Vondra 2025-06-25 12:53:41 Re: pgsql: Introduce pg_shmem_allocations_numa view