RE: Stronger safeguard for archive recovery not to miss data

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Fujii Masao' <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "david(at)pgmasters(dot)net" <david(at)pgmasters(dot)net>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>
Subject: RE: Stronger safeguard for archive recovery not to miss data
Date: 2021-04-05 14:49:04
Message-ID: OSBPR01MB4888CF9DB311C719FFEA2632ED779@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon Apr 5, 2021 12:35 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> On 2021/04/04 11:58, osumi(dot)takamichi(at)fujitsu(dot)com wrote:
> >> IMO it's better to comment why this server restart is necessary.
> >> As far as I understand correctly, this is necessary to ensure the WAL
> >> file containing the record about the change of wal_level (to minimal)
> >> is archived, so that the subsequent archive recovery will be able to replay
> it.
> > OK, added some comments. Further, I felt the way I wrote this part was
> > not good at all and self-evident and developers who read this test would
> feel uneasy about that point.
> > So, a little bit fixed that test so that we can get clearer conviction for wal
> archive.
>
> LGTM. Thanks for updating the patch!
>
> Attached is the updated version of the patch. I applied the following changes.
> Could you review this version? Barring any objection, I'm thinking to commit
> this.
>
> +sub check_wal_level
> +{
> + my ($target_wal_level, $explanation) = @_;
> +
> + is( $node->safe_psql(
> + 'postgres', q{show wal_level}),
> + $target_wal_level,
> + $explanation);
>
> Do we really need this test? This test doesn't seem to be directly related to
> the test purpose. It seems to be testing the behavior of the PostgresNode
> functions. So I removed this from the patch.
Yeah, at the phase to commit, we don't need those.
Let's make only essential parts left.

> +# This protection should apply to recovery mode my $another_node =
> +get_new_node('another_node');
>
> The same test is performed twice with different settings. So isn't it better to
> merge the code for both tests into one common function, to simplify the
> code? I did that.
>
> I also applied some cosmetic changes.
Thank you so much. Your comments are by far more precise.
Further, the refactoring of the two tests by the common function looks really good.

Some minor comments.

(1)

+ # Confirm that the archive recovery fails with an error
+ my $logfile = slurp_file($recovery_node->logfile());
+ ok( $logfile =~
+ qr/FATAL: WAL was generated with wal_level=minimal, cannot continue recovering/,
+ "$node_text ends with an error because it finds WAL generated with wal_level=minimal");

How about a comment
"Confirm that the archive recovery fails with an expected error" ?

(2)

+# Test for archive recovery
+test_recovery_wal_level_minimal('archive_recovery', 'archive recovery', 0);
We have two spaces in one comment. Should be fixed.

(3)

I thought the function name 'test_recovery_wal_level_minimal'
is a little bit weird and can be changed.
How about something like execute_recovery_scenario,
test_recovery_safeguard or test_archive_recovery_safeguard ?

> >>> By the way, when I build postgres with this patch and
> >>> enable-coverage option, the results of RT becomes unstable. Does
> >>> someone know the
> >> reason ?
> >>> When it fails, I get stderr like below
> >>
> >> I have no idea about this. Does this happen even without the patch?
> > Unfortunately, no. I get this only with --enable-coverage and with my
> > patch, althought regression tests have passed with this patch.
> > OSS HEAD doesn't produce the stderr even with --enable-coverage.
>
> Could you check whether the latest patch still causes this issue or not?
> If it still causes, could you check which part (the change of xlog.c or the
> addition of regression test) caused the issue?
v07 reproduces the phenomena, even with make coverage-clean between tests.
The possibility is not high though.

We cannot do the regression test separately from xlog.c
because it uses the new error message of xlog.c.
Applying only the TAP test should fail because
we get an warning not error.

Therefore, I took the changes of xlog.c only and
I'm doing the RT in a loop now. If we can get the stderr again,
then we can guess xlog.c is the cause, right ?

I think I can report the result tomorrow.
Just in case, I'm running the RT for OSS HEAD in parallel...
although I cannot reproduce it with it at all.

Best Regards,
Takamichi Osumi

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message osumi.takamichi@fujitsu.com 2021-04-05 14:54:13 RE: Stronger safeguard for archive recovery not to miss data
Previous Message Jaime Casanova 2021-04-05 14:47:41 Re: use AV worker items infrastructure for GIN pending list's cleanup