Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Date: 2021-06-23 07:45:58
Message-ID: f03d9166-ad12-2a3c-f605-c1873ee86ae4@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 23/06/2021 03:50, Thomas Munro wrote:
> On Wed, Jun 23, 2021 at 2:11 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
>>> Your analysis seems right to me. We have to worry about both things:
>>> atomicity of writes on power failure (assumed to be sector-level,
>>> hence our 512 byte struct -- all good), and atomicity of concurrent
>>> reads and writes (we can't assume anything at all, so r/w locking is
>>> the simplest way to get a consistent read). Shouldn't relmap_redo()
>>> also acquire the lock exclusively?
>>
>> Shouldn't we instead file a kernel bug report? I seem to recall that
>> POSIX guarantees atomicity of these things up to some operation size.
>> Or is that just for pipe I/O?
>
> The spec doesn't cover us according to some opinions, at least:
>
> https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic
>
> But at the same time, the behaviour seems quite surprising given the
> parameters involved and how at least I thought this stuff worked in
> practice (ie what the rules about the visibility of writes that
> precede reads imply for the unspoken locking rule that must be the
> obvious reasonable implementation, and the reality of the inode-level
> read/write locking plainly visible in the source). It's possible that
> it's not working as designed in some weird edge case. I guess the
> next thing to do is write a minimal repro and find an expert to ask
> about what it's supposed to do.

That would be nice. At this point, though, I'm convinced at this point
that the POSIX doesn't give the guarantees we want, or even if it does,
there are a lot of systems out there that don't respect that. Do we rely
on that anywhere else than in load_relmap_file()? I don't think we do.
Let's just add the lock there.

Now, that leaves the question with pg_control. That's a different
situation. It doesn't rely on read() and write() being atomic across
processes, but on a 512 sector write not being torn on power failure.
How strong is that guarantee? It used to be common wisdom with hard
drives, and it was carried over to SSDs although I'm not sure if it was
ever strictly speaking guaranteed. What about the new kid on the block:
Persistent Memory? I found this article:
https://lwn.net/Articles/686150/. So at hardware level, Persistent
Memory only guarantees atomicity at cache line level (64 bytes). To
provide the traditional 512 byte sector atomicity, there's a feature in
Linux called BTT. Perhaps we should add a note to the docs that you
should enable that.

We haven't heard of broken control files from the field, so that doesn't
seem to be a problem in practice, at least not yet. Still, I would sleep
better if the control file had more redundancy. For example, have two
copies of it on disk. At startup, read both copies, and if they're both
valid, ignore the one with older timestamp. When updating it, write over
the older copy. That way, if you crash in the middle of updating it, the
old copy is still intact.

- Heikki

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2021-06-23 09:45:10 Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Previous Message PG Bug reporting form 2021-06-23 07:02:18 BUG #17070: Sometimes copy from ingnores transaction