From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> |
Subject: | Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData |
Date: | 2025-06-03 01:48:46 |
Message-ID: | 20250603014846.f9.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, May 09, 2025 at 02:08:26PM +0900, Michael Paquier wrote:
> On Tue, Feb 18, 2025 at 04:57:47PM -0800, Noah Misch wrote:
> > As I wrote in [1], "By the time we reach consistency, every file in
> > pg_twophase will be applicable (not committed or aborted)." If we find
> > otherwise, the user didn't follow the backup protocol (or there's another
> > bug). Hence, long-term, we should stop these removals and just fail recovery.
> > We can't fix all data loss consequences of not following the backup protocol,
> > so the biggest favor we can do the user is draw their attention to the
> > problem. How do you see it?
>
> Deciding to not trust at all any of the contents of pg_twophase/ until
> consistency is reached is not something we should aim for, IMO. Going
> in this direction would mean to delay restoreTwoPhaseData() until
> consistency is reached, but there are cases where we can read that
> safely, and where we should do so. For example, this flow is
> perfectly OK to do in the wasShutdown case, where
> PrescanPreparedTransactions() would be able to do its initialization
> job before performing WAL recovery to get a clean list of running
> XIDs.
The wasShutdown case reaches consistency from the beginning, so I don't see
that as an example of a time we benefit from reading pg_twophase before
reaching consistency. Can you elaborate on that?
What's the benefit you're trying to get by reading pg_twophase before reaching
consistency?
Before reaching consistency, our normal approach is to let WAL tell us what to
read, not explore the data directory for files of interest. That's a good
principle, because there are few bounds on the chaos that may exist in the
files of the data directory before reaching consistency. Today's twophase
departs from that principle. In light of this thread's problems, we should
have a strong reason for keeping that departure. The default should be to
align with the rest of recovery in this respect.
I can think of one benefit of attempting to read pg_twophase before reaching
consistency. Suppose we can prove that a pg_twophase file will cause an error
by end of recovery, regardless of what WAL contains. It's nice to fail
recovery immediately instead of failing recovery when we reach consistency.
However, I doubt that benefit is important enough to depart from our usual
principle and incur additional storage seeks in order to achieve that benefit.
If recovery will certainly fail, you are going to have a bad day anyway.
Accelerating recovery failure is a small benefit, particularly when we'd
accelerate failure for only a small slice of recovery failure causes.
> I agree that moving towards a solution where we get rid entirely of
> the CLOG lookups in ProcessTwoPhaseBuffer() is what we should aim for,
> and actually is there a reason to not just nuke and replace them
> something based on the checkpoint record itself?
I don't know what this means.
> I have to admit that
> I don't quite see the issue with ReadTwoPhaseFile() when it comes to
> crash recovery. For example, in the case of a partial write, doesn't
> the CRC32 check offer some protection about the contents of the file?
Not the protection we want. If we've not reached consistency, we must not
ERROR "calculated CRC checksum does not match value stored in file" for a file
that later WAL may recreate. That might be what you're saying:
> Wouldn't it be OK in this case to assume that the contents of this
> file will be in WAL anyway?
Sure. Meanwhile, if a twophase file is going to be in later WAL, what's the
value in opening the file before we get to that WAL?
> The base backup issue is a different one, of course, and I think that
> we are going to require more data in the 2PC file to provide a better
> cross-check barrier, which would be the addition to the 2PC file of
> the end LSN where the 2PC file record has been inserted. Then we
> could cross-check that with the redo location, and see that it's
> actually safe to discard the file because we know it will be in WAL.
> This seems like a hefty cost to pay for, though, meaning 8 bytes in
> each 2PC file because base backups were done wrong. Bleh.
I'm not saying we should go out of our way to detect base backup protocol
violations. Weakened detection of base backup protocol violations is one
drawback of acting on pg_twophase before consistency, but it's less important
than the deviation from standard recovery principles.
From | Date | Subject | |
---|---|---|---|
Next Message | jian he | 2025-06-03 02:51:27 | Re: pg18: Virtual generated columns are not (yet) safe when superuser selects from them |
Previous Message | Jeff Davis | 2025-06-03 01:41:35 | Re: pg18: Virtual generated columns are not (yet) safe when superuser selects from them |