Re: [BUG] non archived WAL removed during production crash recovery

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: michael(at)paquier(dot)xyz, masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: [BUG] non archived WAL removed during production crash recovery
Date: 2020-04-21 22:00:22
Message-ID: 20200422000022.7aa6a62b@firost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, 21 Apr 2020 15:08:17 +0900 (JST)
Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:

> At Tue, 21 Apr 2020 13:57:39 +0900, Michael Paquier <michael(at)paquier(dot)xyz>
> wrote in
> > On Tue, Apr 21, 2020 at 12:09:25PM +0900, Kyotaro Horiguchi wrote:
> > > + if (!XLogArchivingAlways() &&
> > > + GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
> > >
> > > Is rewritten as
> > >
> > > + if (!XLogArchivingAlways() &&
> > > + GetDBState() > DB_IN_CRASH_RECOVERY)
> > >
> > > FWIW, what annoyed me is there are three variables that are quite
> > > similar but has different domains, ControlFile->state,
> > > XLogCtl->SharedRecoveryState, and LocalRecoveryInProgress. I didn't
> > > mind there were two, but three seems a bit too many to me.

In fact, my original goal [1] was to avoid adding another shared boolean related
to the same topic. So I understand your feeling.

I played with this idea again, based on your argument that there's no problem
if the update to DB_IN_ARCHIVE_RECOVERY reaches checkpointer with some delay.

The other point I feel bad with is to open and check the controlfile again and
again for each segment to archive, even on running production instance.
It's not that it would be heavy, but it feels overkill to fetch this information
that should be available more easily.

That leads me an idea where we would keep the ControlFile data up-to-date in
shared memory. There's a few duplicates between ControlFile and XLogCtl, so
maybe it could make the code a little simpler at some other places than just
fixing $SUBJECT using DBState? I'm not sure of the implications and impacts
though. This seems way bigger than the current fix and with many traps on the
way. Maybe we could discuss this in another thread if you think it deserves it?

Regards,

[1]
https://www.postgresql.org/message-id/flat/20200403182625.0fccc6fd%40firost#28a756094a4b1f3dd24927fb6311927d

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2020-04-21 22:31:19 Re: Bug with memory leak on cert validation in libpq
Previous Message Euler Taveira 2020-04-21 19:47:44 Re: BUG #16354: No geos 3.8.1 package for RHEL 8

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2020-04-21 22:10:54 Re: Do we need to handle orphaned prepared transactions in the server?
Previous Message Alvaro Herrera 2020-04-21 21:57:26 Re: Remove non-fast promotion Re: Should we remove a fallback promotion? take 2