Re: pg_rewind WAL segments deletion pitfall

From: torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: bungina(at)gmail(dot)com, cyberdemn(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: pg_rewind WAL segments deletion pitfall
Date: 2023-08-18 06:40:57
Message-ID: 8b385bb6d5f87e54c1c6333fece0444a@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 2022-09-29 17:18, Polina Bungina wrote:
> I agree with your suggestions, so here is the updated version of
> patch. Hope I haven't missed anything.

Thanks for the patch, I've marked this as ready-for-committer.

BTW, this issue can be considered a bug, right?
I think it would be appropriate to provide backpatch.

On 2023-06-29 18:42, torikoshia wrote:
> On 2023-06-29 10:25, Kyotaro Horiguchi wrote:
> Thanks for the comment!
>
>> At Wed, 28 Jun 2023 22:28:13 +0900, torikoshia
>> <torikoshia(at)oss(dot)nttdata(dot)com> wrote in
>>>
>>> On 2022-09-29 17:18, Polina Bungina wrote:
>>> > I agree with your suggestions, so here is the updated version of
>>> > patch. Hope I haven't missed anything.
>>> > Regards,
>>> > Polina Bungina
>>>
>>> Thanks for working on this!
>>> It seems like we are also facing the same issue.
>>
>> Thanks for looking this.
>>
>>> I tested the v3 patch under our condition, old primary has succeeded
>>> to become new standby.
>>>
>>>
>>> BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached
>>> in [1], old primary also failed to become standby:
>>>
>>> FATAL: could not receive data from WAL stream: ERROR: requested WAL
>>> segment 000000020000000000000007 has already been removed
>>>
>>> However, I think this is not a problem: just adding restore_command
>>> like below fixed the situation.
>>>
>>> echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >>
>>> oldprim/postgresql.conf
>>
>> I thought on the same line at first, but that's not the point
>> here.
>
> Yes. I don't think adding restore_command solves the problem and
> modification to prevent deleting necessary WAL like proposed
> patch is necessary.
>
> I added restore_command since
> pg_rewind-removes-wal-segments-reproduce.sh failed to catch up
> even after applying v3 patch and prevent pg_rewind from delete
> WALs(*), because some necessary WALs were archived.
>
> It's not a problem we are discussing here, but I wanted to get
> the script to work to the point where old primary could
> successfully catch up to new primary.
>
> (*)Specifically, running the script without apply the patch,
> recovery failed because 000000010000000000000003 which has
> already been removed. This file was deleted by pg_rewind as
> we know.
> OTHO without the restore_command, recovery failed because
> 000000020000000000000007 has already been removed even after
> applying the patch.
>
>> The problem we want ot address is that pg_rewind ultimately
>> removes certain crucial WAL files required for the new primary to
>> start, despite them being present previously.
>
> I thought it's not "new primary", but "old primary".
>
>> In other words, that
>> restore_command works, but it only undoes what pg_rewind wrongly did,
>> resulting in unnecessary consupmtion of I/O and/or network bandwidth
>> that essentially serves no purpose.
>
> As far as I tested using the script and the situation we are facing,
> after promoting newprim necessary WAL(000000010000000000000003..) were
> not available and just adding restore_command did not solve the
> problem.
>
>> pg_rewind already has a feature that determines how each file should
>> be handled, but it is currently making wrong dicisions for WAL
>> files. The goal here is to rectify this behavior and ensure that
>> pg_rewind makes the right decisions.
>
> +1

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2023-08-18 09:30:51 BUG #18060: Left joining rows using random() function in join condition doesn't work as expected.
Previous Message Michael Paquier 2023-08-18 06:18:58 Re: BUG #18057: unaccent removes intentional spaces

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-08-18 07:02:36 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Peter Eisentraut 2023-08-18 05:59:34 Re: dubious warning: FORMAT JSON has no effect for json and jsonb types