RE: Stronger safeguard for archive recovery not to miss data

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Stronger safeguard for archive recovery not to miss data
Date: 2020-12-08 03:08:20
Message-ID: OSBPR01MB48889E884ABE377AE505F543EDCD0@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

On Thursday, November 26, 2020 4:29 PM
Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> At Thu, 26 Nov 2020 07:18:39 +0000, "osumi(dot)takamichi(at)fujitsu(dot)com"
> <osumi(dot)takamichi(at)fujitsu(dot)com> wrote in
> > The attached patch is intended to prevent a scenario that archive
> > recovery hits WALs which come from wal_level=minimal and the server
> > continues to work, which was discussed in the thread of [1].
> > The motivation is to protect that user ends up with both getting
> > replica that could miss data and getting the server to miss data in targeted
> recovery mode.
> >
> > About how to modify this, we reached the consensus in the thread.
> > It is by changing the ereport's level from WARNING to FATAL in
> CheckRequiredParameterValues().
> >
> > In order to test this fix, what I did is
> > 1 - get a base backup during wal_level is replica
> > 2 - stop the server and change the wal_level from replica to minimal
> > 3 - restart the server(to generate XLOG_PARAMETER_CHANGE)
> > 4 - stop the server and make the wal_level back to replica
> > 5 - start the server again
> > 6 - execute archive recoveries in both cases
> > (1) by editing the postgresql.conf and
> > touching recovery.signal in the base backup from 1th step
> > (2) by making a replica with standby.signal
> > * During wal_level is replica, I enabled archive_mode in this test.
> >
> > First of all, I confirmed the server started up without this patch.
> > After applying this safeguard patch, I checked that the server cannot
> > start up any more in the scenario case.
> > I checked the log and got the result below with this patch.
> >
> > 2020-11-26 06:49:46.003 UTC [19715] FATAL: WAL was generated with
> > wal_level=minimal, data may be missing
> > 2020-11-26 06:49:46.003 UTC [19715] HINT: This happens if you
> temporarily set wal_level=minimal without taking a new base backup.
> >
> > Lastly, this should be backpatched.
> > Any comments ?
>
> Perhaps we need the TAP test that conducts the above steps.
I added the TAP tests to reproduce and share the result,
using the case of 6-(1) described above.
Here, I created a new file for it because the purposes of other files in
src/recovery didn't match the purpose of my TAP tests perfectly.
If you are dubious about this idea, please have a look at the comments
in each file.

When the attached patch is applied,
my TAP tests are executed like other ones like below.

t/018_wal_optimize.pl ................ ok
t/019_replslot_limit.pl .............. ok
t/020_archive_status.pl .............. ok
t/021_row_visibility.pl .............. ok
t/022_archive_recovery.pl ............ ok
All tests successful.

Also, I confirmed that there's no regression by make check-world.
Any comments ?

Best,
Takamichi Osumi

Attachment Content-Type Size
stronger_safeguard_for_archive_recovery_v02.patch application/octet-stream 3.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2020-12-08 03:09:50 RE: [bug fix] ALTER TABLE SET LOGGED/UNLOGGED on a partitioned table does nothing silently
Previous Message Kyotaro Horiguchi 2020-12-08 03:04:04 Re: Huge memory consumption on partitioned table with FKs