Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

From: "Anton A(dot) Melnikov" <aamelnikov(at)inbox(dot)ru>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Date: 2023-01-31 01:09:58
Message-ID: 20d45197-80be-4984-33a9-a8d13708992a@inbox.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello!

On 24.11.2022 04:02, Thomas Munro wrote:
> On Thu, Nov 24, 2022 at 11:05 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
>
> ERROR: calculated CRC checksum does not match value stored in file
>
> The attached draft patch fixes it.

Tried to catch this error on my PC, but failed to do it within a reasonable time.
The 1s interval is probably too long for me.
It seems there are more durable way to reproduce this bug with 0001 patch applied:
At the first backend:

do $$ begin loop perform pg_update_control_file(); end loop; end; $$;

At the second one:

do $$ begin loop perform pg_control_system(); end loop; end; $$;

It will fails almost immediately with:
"ERROR: calculated CRC checksum does not match value stored in file"
both with fsync = off and fsync = on.
Checked it out for master and REL_11_STABLE.

Also checked for a few hours that the patch 0002 fixes this error,
but there are some questions to its logical structure.
The equality between the previous and newly calculated crc is checked only
if the last crc calculation was wrong, i.e not equal to the value stored in the file.
It is very unlikely that in this case the previous and new crc can match, so, in fact,
the loop will spin until crc is calculated correctly. In the other words,
this algorithm is practically equivalent to an infinite loop of reading from a file
and calculating crc while(EQ_CRC32C(crc, ControlFile->crc) != true).
But then it can be simplified significantly by removing checksums equality checks,
bool fist_try and by limiting the maximum number of iterations
with some constant in the e.g. for loop to avoid theoretically possible freeze.

Or maybe use the variant suggested by Tom Lane, i.e, as far as i understand,
repeat the file_open-read-close-calculate_crc sequence twice without a pause between
them and check the both calculated crcs for the equality. If they match, exit and return
the bool result of comparing between the last calculation with the value from the file,
if not, take a pause and repeat everything from the beginning.
In this case, no need to check *crc_ok_p inside get_controlfile()
as it was in the present version; i think it's more logically correct
since this variable is intended top-level functions and the logic
inside get_controlfile() should not depend on its value.

Also found a warning in 0001 patch for master branch. On my PC gcc gives:

xlog.c:2507:1: warning: no previous prototype for ‘pg_update_control_file’ [-Wmissing-prototypes]
2507 | pg_update_control_file()

Fixed it with #include "utils/fmgrprotos.h" to xlog.c and
add PG_FUNCTION_ARGS to pg_update_control_file().

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-01-31 01:19:13 Re: Assertion failure in SnapBuildInitialSnapshot()
Previous Message Thomas Munro 2023-01-31 01:00:05 Re: pg_upgrade test failure