Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

From: "Anton A(dot) Melnikov" <aamelnikov(at)inbox(dot)ru>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Date: 2023-02-24 10:12:47
Message-ID: 0194f905-1356-d300-6079-f2b86fa94934@inbox.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Thomas!

On 17.02.2023 06:21, Thomas Munro wrote:
> There are two kinds of atomicity that we rely on for the control file today:
>
> * atomicity on power loss (= device property, in case of overwrite filesystems)
> * atomicity of concurrent reads and writes (= VFS or kernel buffer
> pool interlocking policy)
>
> I assume that documentation is talking about the first thing

I think this is true, as documentation says about write operations only,
but not about the read ones.

> (BTW, I
> suspect that documentation is also now wrong in one special case: NTFS
> filesystems mounted on DAX devices don't have sectors or sector-based
> atomicity unless you turn on BTT and slow them down[1]; that might
> eventually be something for PostgreSQL to think about, and it applies
> to other OSes too).

Very interesting find! For instance, the volume of Intel® Optane™ Persistent Memory
already reaches 512GB and can be potentially used for cluster data. As the
first step it would be good to understand what Microsoft means by
large memory pages, what size are they talking about, that is, where
is the reasonable boundary for using BTT; i suppose, this will help choose
whether to use BTT or have to write own DAX-aware code.

> With this patch we would stop relying on the second thing. Supposedly
> POSIX requires read/write atomicity, and many file systems offer it in
> a strong form (internal range locking) or maybe a weak accidental form
> (page-level locking). Since some extremely popular implementations
> just don't care, and Windows isn't even a POSIX, we just have to do it
> ourselves.

Yes. indeed. But unless it's an atomic or transactional filesystem. In
such a case there is almost nothing to do. Another thing is that
it seems such a systems do not exist in reality although it has been
discussed many times I've googled some information on this topic e.g
[1], [2], [3] but all of project were abandoned or deprecated as
Microsoft's own development.

> BTW there are at least two other places where PostgreSQL already knows
> that concurrent reads and writes are possibly non-atomic (and we also
> don't even try to get the alignment right, making the question moot):
> pg_basebackup, which enables full_page_writes implicitly if you didn't
> have the GUC on already, and pg_rewind, which refuses to run if you
> haven't enabled full_page_writes explicitly (as recently discussed on
> another thread recently; that's an annoying difference, and also an
> annoying behaviour if you know your storage doesn't really need it!)

It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?

> Therefore, we need a solution for Windows too. I tried to write the
> equivalent code, in the attached. I learned a few things: Windows
> locks are mandatory (that is, if you lock a file, reads/writes can
> fail, unlike Unix). Combined with async release, that means we
> probably also need to lock the file in xlog.c, when reading it in
> xlog.c:ReadControlFile() (see comments). Before I added that, on a CI
> run, I found that the read in there would sometimes fail, and adding
> the locking fixed that. I am a bit confused, though, because I
> expected that to be necessary only if someone managed to crash while
> holding the write lock, which the CI tests shouldn't do. Do you have
> any ideas?

Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
Maybe logs of such a fail were saved somewhere? I would like to see
them if possible.

> While contemplating what else a mandatory file lock might break, I
> remembered that basebackup.c also reads the control file. Hrmph. Not
> addressed yet; I guess it might need to acquire/release around
> sendFile(sink, XLOG_CONTROL_FILE, ...)?

Here, possibly pass a boolean flag into sendFile()? When it is true,
then take a lock after OpenTransientFile() and release it before
CloseTransientFile() if under Windows.

There are also two places where read or write from/to the pg_control
occur. These are functions WriteControlFile() in xlog.c and
read_controlfile() in pg_resetwal.c.
For the second case, locking definitely not necessary as
the server is stopped. For the first case seems too as BootStrapXLOG()
where WriteControlFile() will be called inside must be called
only once on system install.

Since i've smoothly moved on to the code review here there is a
suggestion at your discretion to add error messages to get_controlfile()
and update_controlfile() if unlock_controlfile() fails.

> I think we should just make fdatasync the default on all
> systems.

Agreed. And maybe choose the UPDATE_CONTROLFILE_FDATASYNC as the default
case in UpdateControlFile() since fdatasync now the default on all
systems and its metadata are of no interest?

On 22.02.2023 03:10, Thomas Munro wrote:
> If we go this way, I suppose, in theory at least, someone with
> external pg_backup_start()-based tools might also want to hold the
> lock while copying pg_control. Otherwise they might fail to open it
> on Windows (where that patch uses a mandatory lock)

As for external pg_backup_start()-based tools if somebody want to take the
lock while copying pg_control i suppose it's a normal case. He may have to wait
a bit until we release the lock, like in lock_controlfile(). Moreover
this is a very good desire, as it guarantees the integrity of pg_control if
only someone is going to use F_SETLKW rather than non-waiting F_SETLK.

Backup of locked files is the common place in Windows and all standard
backup tools can do it well via VSS (volume shadow copy) including
embedded windows backup tool. Just in case, i tested it experimentally.
During the torture test first try to copy pg_control and predictably caught:
Error: Cannot read C:\pgbins\master\data\global\pg_control!
"The process cannot access the file because another process has locked a portion of the file."
But copy using own windows backup utility successfully copied it with VSS.

> > One cute hack I thought of to make the file lock effectively advisory
> on Windows is to lock a byte range *past the end* of the file (the
> documentation says you can do that). That shouldn't stop programs
> that want to read the file without locking and don't know/care about
> our scheme (that is, pre-existing backup tools that haven't considered
> this problem and remain oblivious or accept the (low) risk of torn
> reads), but will block other participants in our scheme.

A very interesting idea. It makes sense when someone using external
backup tools that can not work with VSS. But the fact of using such a tools
under Windows is highly doubtful, i guess. It will not allow to backup
many other applications and windows system itself.
Let me to join you suggestion that it'd be good to hear from backup
gurus what they think about that.

> or copy garbage on
> Linux (as they can today, I assume), with non-zero probability -- at
> least when copying files from a hot standby.
> Or backup tools might
> want to get the file contents through some entirely different
> mechanism that does the right interlocking (whatever that might be,
> maybe inside the server). Perhaps this is not so much the localised
> systems programming curiosity I thought it was, and has implications
> that'd need to be part of the documented low-level backup steps. It
> makes me like the idea a bit less. It'd be good to hear from backup
> gurus what they think about that.
> If we went back to the keep-rereading-until-it-stops-changing model,
> then an external backup tool would need to be prepared to do that too,
> in theory at least. Maybe some already do something like that?
>
> Or maybe the problem is/was too theoretical before...

As far as i understand, this problem has always been, but the probability of
this is extremely small in practice, which is directly pointed in
the docs [4]:
"So while it is theoretically a weak spot, pg_control does not seem
to be a problem in practice."

> Here's a patch like that.

In this patch, the problem is solved for the live database and
maybe remains for some possible cases of an external backup. In a whole,
i think it can be stated that this is a sensible step forward.

Just like last time, i ran a long stress test under windows with current patch.
There were no errors for more than 3 days even with fsync=off.

[1] https://lwn.net/Articles/789600/
[2] https://github.com/ut-osa/txfs
[3] https://en.wikipedia.org/wiki/Transactional_NTFS[4] https://www.postgresql.org/docs/devel/wal-internals.html

With the best wishes!

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-02-24 10:25:55 Re: PATCH: Using BRIN indexes for sorted output
Previous Message Daniel Gustafsson 2023-02-24 09:49:13 Re: TAP output format in pg_regress