Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: simseih(at)amazon(dot)com, alvherre(at)alvh(dot)no-ip(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Date: 2022-06-28 00:46:27
Message-ID: 20220628.094627.1229111489487982500.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 27 Jun 2022 15:02:11 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> On Fri, Jun 24, 2022 at 04:17:34PM +0000, Imseih (AWS), Sami wrote:
> > It is been difficult to get a generic repro, but the way we reproduce
> > Is through our test suite. To give more details, we are running tests
> > In which we constantly failover and promote standbys. The issue
> > surfaces after we have gone through a few promotions which occur
> > every few hours or so ( not really important but to give context ).
>
> Hmm. Could you describe exactly the failover scenario you are using?
> Is the test using a set of cascading standbys linked to the promoted
> one? Are the standbys recycled from the promoted nodes with pg_rewind
> or created from scratch with a new base backup taken from the
> freshly-promoted primary? I have been looking more at this thread
> through the day but I don't see a remaining issue. It could be
> perfectly possible that we are missing a piece related to the handling
> of those new overwrite contrecords in some cases, like in a rewind.
>
> > I am adding some additional debugging to see if I can draw a better
> > picture of what is happening. Will also give aborted_contrec_reset_3.patch
> > a go, although I suspect it will not handle the specific case we are deaing with.
>
> Yeah, this is not going to change much things if you are still seeing
> an issue. This patch does not change the logic, aka it just

True. That is a siginicant hint on what happened at the time.

- Are there only two hosts in the replication set? I concerned on
whether it is a cascading set or not.

- Exactly what are you performing at every failover? Especially do
the steps contain pg_rewind, and do you copy pg_wal and/or archive
files between the failover hosts?

> simplifies the tracking of the continuation record data, resetting it
> when a complete record has been read. Saying that, getting rid of the
> dependency on StandbyMode because we cannot promote in the middle of a
> record is nice (my memories around that were a bit blurry but even
> recovery_target_lsn would not recover in the middle of an continuation
> record), and this is not bug so there is limited reason to backpatch
> this part of the change.

Agreed. In the first place my "repro" (or the test case) is a bit too
intricated to happen in the real field.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-06-28 01:12:48 Repeatability of installcheck for test_oat_hooks
Previous Message Justin Pryzby 2022-06-28 00:18:07 Re: Allowing REINDEX to have an optional name