Re: pg_rewind with cascade standby doesn't work well

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Kuwamura Masaki <kuwamura(at)db(dot)is(dot)i(dot)nagoya-u(dot)ac(dot)jp>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Subject: Re: pg_rewind with cascade standby doesn't work well
Date: 2023-09-26 23:33:03
Message-ID: ZRNqL8pxKnidU8wq@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 26, 2023 at 06:44:50PM +0300, Aleksander Alekseev wrote:
>> And also, I'm afraid that I'm not sure what kind of tests I have to make
>> for fix this behavior. Would you mind giving me some advice?
>
> Personally I would prefer not to increase the scope of work. Your TAP
> test added in 0001 seems to be adequate.

Yeah, agreed. I'm OK with what you are proposing, basically (the
test could be made a bit cheaper actually).

/*
- * For Hot Standby, we could treat this like a Shutdown Checkpoint,
- * but this case is rarer and harder to test, so the benefit doesn't
- * outweigh the potential extra cost of maintenance.
+ * For Hot Standby, we could treat this like an end-of-recovery checkpoint
*/
+ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);

I don't understand what you want to change here. Archive recovery and
crash recovery are two different things, still this code would be
triggered even if there is no standby.signal, aka the node is not a
standby. Why isn't this stuff conditional?

>> BTW, I was able to
>> reproduce the assertion failure Kuwamura-san reported, even after applying
>> your latest patch from the thread.
>
> Do you mean that the test fails or it doesn't but there are other
> steps to reproduce the issue?

I get it as Fujii-san testing the patch from [1], still failing the
test from [2]:
[1]: https://www.postgresql.org/message-id/ZArVOMifjzE7f8W7@paquier.xyz
[2]: https://www.postgresql.org/message-id/CAMyC8qryE7mKyvPvGHCt5GpANAmp8sS_tRbraqXcPBx14viy6g@mail.gmail.com

I would be surprised, actually, because the patch from [1] would cause
step 7 of the test to fail: the patch causes standby.signal or
recovery.signal to be required. Anyway, this specific issue, if any,
had better be discussed on the other thread. I need to address a few
comments there as well and was planning to get back to it. It is
possible that I've missed something on the other thread with the
restrictions I was proposing in the latest version of the patch.

For this thread, let's focus on the pg_rewind case and how we want to
treat these records to improve the cascading case.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-09-26 23:36:31 Re: [PoC/RFC] Multiple passwords, interval expirations
Previous Message Michael Paquier 2023-09-26 23:14:15 Re: Fail hard if xlogreader.c fails on out-of-memory