Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: "Anton A(dot) Melnikov" <aamelnikov(at)inbox(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Date: 2023-02-17 03:21:14
Message-ID: CA+hUKGLkOYgHLmErZ1jAgR3WKzJRxCjJumFM4mkkba7+hv-NGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov <aamelnikov(at)inbox(dot)ru> wrote:
> First of all it seemed to me that is not a problem at all since msdn
> guarantees sector-by-sector atomicity.
> "Physical Sector: The unit for which read and write operations to the device
> are completed in a single operation. This is the unit of atomic write..."
> https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering
> https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile
> (Of course, only if the 512 bytes lays from the beginning of the file with a zero
> offset, but this is our case. The current size of ControlFileData is
> 296 bytes at offset = 0.)

There are two kinds of atomicity that we rely on for the control file today:

* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking policy)

I assume that documentation is talking about the first thing (BTW, I
suspect that documentation is also now wrong in one special case: NTFS
filesystems mounted on DAX devices don't have sectors or sector-based
atomicity unless you turn on BTT and slow them down[1]; that might
eventually be something for PostgreSQL to think about, and it applies
to other OSes too).

With this patch we would stop relying on the second thing. Supposedly
POSIX requires read/write atomicity, and many file systems offer it in
a strong form (internal range locking) or maybe a weak accidental form
(page-level locking). Since some extremely popular implementations
just don't care, and Windows isn't even a POSIX, we just have to do it
ourselves.

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

> I tried to verify this fact experimentally and was very surprised.
> Unfortunately it crashed in an hour during torture test:
> 2023-02-13 15:10:52.675 MSK [5704] LOG: starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
> 2023-02-13 15:10:52.768 MSK [5704] LOG: database system is ready to accept connections
> @@@@@@ sizeof(ControlFileData) = 296
> .........
> 2023-02-13 16:10:41.997 MSK [9380] ERROR: calculated CRC checksum does not match value stored in file

Thanks. I'd seen reports of this in discussions on the 'net, but
those had no authoritative references or supporting test results. The
fact that fsync made it take longer (in your following email) makes
sense and matches Linux. It inserts a massive pause in the middle of
the experiment loop, affecting the probabilities.

Therefore, we need a solution for Windows too. I tried to write the
equivalent code, in the attached. I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix). Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments). Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that. I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do. Do you have
any ideas?

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file. Hrmph. Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

> > I would
> > only want to consider this if we also stop choosing "open_datasync" as
> > a default on at least Windows.
>
> I didn't quite understand this point. Could you clarify it for me, please? If the performance
> of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all platforms.

The level of durability would be weakened on Windows. On all systems
except Linux and FreeBSD, we default to wal_sync_method=open_datasync,
so then we would start using FILE_FLAG_WRITE_THROUGH for the control
file too. We know from reading and experimentation that
FILE_FLAG_WRITE_THROUGH doesn't flush the drive cache on consumer
Windows hardware, but its fdatasync-like thing does[2]. I have not
thought too hard about the consequences of using different durability
levels for different categories of file, but messing with write
ordering can't be good for crash recovery, so I'd rather increase WAL
durability than decrease control file durability. If a Windows user
complains that it makes their fancy non-volatile cache slow down, they
can always adjust the settings in PostgreSQL, their OS, or their
drivers etc. I think we should just make fdatasync the default on all
systems.

Here's a patch like that.

[1] https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/persistent-memory-direct-access
[2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Ba-7r4GpADsasCnuDBiqC1c31DAQQco2FayVtB9V3sQw%40mail.gmail.com#460bfa5a6b571cc98c575d23322e0b25

Attachment Content-Type Size
0001-Lock-pg_control-while-reading-or-writing.patch text/x-patch 6.6 KB
0002-Make-wal_sync_method-fdatasync-the-default-on-all-OS.patch text/x-patch 5.9 KB
0003-Apply-wal_sync_method-to-pg_control-file.patch text/x-patch 5.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-02-17 03:24:02 Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy
Previous Message Amit Kapila 2023-02-17 03:00:09 Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16