Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Date: 2021-06-23 00:33:47
Message-ID: YNKBazxEayjtyb1x@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Jun 22, 2021 at 10:11:06AM -0400, Tom Lane wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
>> Your analysis seems right to me. We have to worry about both things:
>> atomicity of writes on power failure (assumed to be sector-level,
>> hence our 512 byte struct -- all good), and atomicity of concurrent
>> reads and writes (we can't assume anything at all, so r/w locking is
>> the simplest way to get a consistent read). Shouldn't relmap_redo()
>> also acquire the lock exclusively?

You are implying anything calling write_relmap_file(), right?

> Shouldn't we instead file a kernel bug report? I seem to recall that
> POSIX guarantees atomicity of these things up to some operation size.
> Or is that just for pipe I/O?

Even if this is recognized as a bug report, it seems to me that we'd
better cope with an extra lock for instances that may run into this
issue anyway in the future, no? Just to be on the safe side.

> If we can't assume atomicity of relmapper file I/O, I wonder about
> pg_control as well. But on the whole, what I'm smelling is a moderately
> recently introduced kernel bug. We've been doing this this way for
> years and heard no previous reports.

True. PG_CONTROL_MAX_SAFE_SIZE relies on that. Now, the only things
updating the control file are the startup process and the checkpointer
so that's less prone to conflicts contrary to the reported problem
here, and the code takes a ControlFileLock where necessary.
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2021-06-23 00:41:19 Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Previous Message Michael Paquier 2021-06-23 00:20:31 Re: BUG #17062: Assert failed in RemoveRoleFromObjectPolicy() on DROP OWNED policy applied to duplicate role