Re: standby promotion can create unreadable WAL

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: standby promotion can create unreadable WAL
Date: 2022-08-24 12:13:36
Message-ID: CA+Tgmoa2v6xr2t-nvHNXGOeHri+EPWZ-w3HU9=S8VjkFGiWYAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 24, 2022 at 4:40 AM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> Me, too. There are two ways to deal with this, I think. One is start
> writing new records from abortedContRecPtr as if it were not
> exist. Another is copying WAL file up to missingContRecPtr. Since the
> first segment of the new timeline doesn't need to be identcal to the
> last one of the previous timeline, so I think the former way is
> cleaner.

I agree, mostly because that gets us back to the way all of this
worked before the contrecord stuff went in. This case wasn't broken
then, because the breakage had to do with it being unsafe to back up
and rewrite WAL that might have already been shipped someplace, and
that's not an issue when we're first creating a totally new timeline.
It seems safer to me to go back to the way this worked before the fix
went in than to change over to a new system.

Honestly, in a vacuum, I might prefer to get rid of this thing where
the WAL segment gets copied over from the old timeline to the new, and
just always switch TLIs at segment boundaries. And while we're at it,
I'd also like TLIs to be 64-bit random numbers instead of integers
assigned in ascending order. But those kinds of design changes seem
best left for a future master-only development effort. Here, we need
to back-patch the fix, and should try to just unbreak what's currently
broken.

> XLogInitNewTimeline or near seems to be be the place for fix
> to me. Clearing abortedRecPtr and missingContrecPtr just before the
> call to findNewestTimeLine will work?

Hmm, yeah, that seems like a good approach.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2022-08-24 12:14:16 Re: Change pfree to accept NULL argument
Previous Message Richard Guo 2022-08-24 11:54:36 Re: Stack overflow issue