Re: A failure in t/038_save_logical_slots_shutdown.pl

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: A failure in t/038_save_logical_slots_shutdown.pl
Date: 2024-01-13 11:12:55
Message-ID: CAA4eK1+F5P9CEVNquD-fFd0-sxEkOPBGigM5iEcs8d8Yv+2HUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 12, 2024 at 3:36 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Fri, Jan 12, 2024 at 9:28 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 11, 2024 at 10:03 PM Bharath Rupireddy
> > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Jan 11, 2024 at 4:35 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > >
> > > > On further analysis, it was found that in the failing test,
> > > > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL
> > > > page header present just before the CHECKPOINT_SHUTDOWN which was
> > > > causing the failure. We could alternatively reproduce the issue by
> > > > switching the WAL file before restarting the server like in the
> > > > attached test change patch.
> > > > There are a couple of ways to fix this issue a) one by switching the
> > > > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN
> > > > does not get inserted in a new page as in the attached test_fix.patch
> > > > b) by using pg_walinspect to check that the next WAL record is
> > > > CHECKPOINT_SHUTDOWN. I have to try this approach.
> > > >
> > > > Thanks to Bharath and Kuroda-san for offline discussions and helping
> > > > in getting to the root cause.
> > >
> > > IIUC, the problem the commit e0b2eed tries to solve is to ensure there
> > > are no left-over decodable WAL records between confirmed_flush LSN and
> > > a shutdown checkpoint, which is what it is expected from the
> > > t/038_save_logical_slots_shutdown.pl. How about we have a PG function
> > > returning true if there are any decodable WAL records between the
> > > given start_lsn and end_lsn?
> > >
> >
> > But, we already test this in 003_logical_slot during a successful
> > upgrade. Having an explicit test to do the same thing has some merits
> > but not sure if it is worth it.
>
> If the code added by commit e0b2eed is covered by the new upgrade
> test, why not remove 038_save_logical_slots_shutdown.pl altogether?
>

This is a more strict check because it is possible that even if the
latest confirmed_flush location is not persisted there is no
meaningful decodable WAL between whatever the last confirmed_flush
location saved on disk and the shutdown_checkpoint record.
Kuroda-San/Vignesh, do you have any suggestion on this one?

> > The current test tries to ensure that
> > during shutdown after we shutdown walsender and ensures that it sends
> > all the wal records and receipts an ack for the same, there is no
> > other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes
> > the test robust enough that it shouldn't show spurious failures like
> > the current one reported by you, so shall we try to proceed with that?
>
> Do you mean something like [1]? It ensures the test passes unless any
> writes are added (in future) before the publisher restarts in the test
> which can again make the tests flaky. How do we ensure no one adds
> anything in before the publisher restart
> 038_save_logical_slots_shutdown.pl? A note before the restart perhaps?
>

I am fine with adding the note.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-01-13 11:21:00 Re: Invalidate the subscription worker in cases where a user loses their superuser status
Previous Message Amit Kapila 2024-01-13 11:02:47 Re: speed up a logical replica setup