Re: How abnormal server shutdown could be detected by tests?

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: How abnormal server shutdown could be detected by tests?
Date: 2023-12-12 15:00:00
Message-ID: 5921355f-4cfb-c91a-24b8-6bbde53c990c@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Shveta,

12.12.2023 11:44, shveta malik wrote:
>
>> The postmaster process exits with exit code 1, but pg_ctl can't get the
>> code and just reports that stop was completed successfully.
>>
> For what it's worth, there is another thread which stated the similar problem:
> https://www.postgresql.org/message-id/flat/2366244.1651681550%40sss.pgh.pa.us
>

Thank you for the reference!
So I refreshed a first part of the question Tom Lane raised before...

I've made a quick experiment with leaving postmaster.pid intact in case of
abnormal shutdown:
@@ -1113,6 +1113,7 @@ UnlinkLockFiles(int status, Datum arg)
     {
         char       *curfile = (char *) lfirst(l);

+if (strcmp(curfile, DIRECTORY_LOCK_FILE) != 0 || status == 0)
         unlink(curfile);
         /* Should we complain if the unlink fails? */
     }

and `make check-world` passed for me with no failure.
(In the meantime, the assertion failure forced as above is detected.)

Though there is a minor issue with a couple of tests. Namely,
003_recovery_targets.pl does the following:
# wait for the error message in the standby log
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
    $logfile = slurp_file($node_primary->logfile());
    $res = ($logfile =~
        qr/FATAL: .* recovery ended before configured recovery target was reached/);
    if ($res) {
        last;
    }
    usleep(100_000);
}
ok($res,
    'recovery end before target reached is a fatal error');

With postmaster.pid left after unclean shutdown, the test waits for 300
seconds by default and then completes successfully.

If rewrite that loop as follows:
# wait for the error message in the standby log
foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default)
{
    $logfile = slurp_file($node_primary->logfile());
    $res = ($logfile =~
        qr/FATAL: .* recovery ended before configured recovery target was reached/);
    if ($res) {
        last;
    }
    usleep(100_000);
}
ok($res,
    'recovery end before target reached is a fatal error');

the test completes as quickly as before.
(standby.log is only 2kb, so rereading it isn't a big deal, IMO)

So maybe it's the way to go?

Another way I can think of is sending some signal to pg_ctl in case
postmaster terminates with status 0. Though I think it would complicate
things a little as it allows for three different states:
postmaster.pid preserved (in case postmaster killed with -9),
postmaster.pid removed and the signal received/not received.

Best regards,
Alexander

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-12-12 15:18:00 Re: Add --check option to pgindent
Previous Message Xiaoran Wang 2023-12-12 14:37:54 Re: [PATCH]: Not to invaldiate CatalogSnapshot for local invalidation messages