From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru> |
Subject: | Re: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData |
Date: | 2025-06-06 22:34:13 |
Message-ID: | 20250606223413.10.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jun 05, 2025 at 04:22:48PM +0900, Michael Paquier wrote:
> On Mon, Jun 02, 2025 at 06:48:46PM -0700, Noah Misch wrote:
> > The wasShutdown case reaches consistency from the beginning, so I don't see
> > that as an example of a time we benefit from reading pg_twophase before
> > reaching consistency. Can you elaborate on that?
> >
> > What's the benefit you're trying to get by reading pg_twophase before reaching
> > consistency?
>
> My point is mostly about code complicity and consistency, by using the
> same logic (aka reading the contents of pg_twophase) at the same point
> in the recovery process
That's a nice goal.
> and perform some filtering of the contents we
> know we will not be able to trust. So I mean to do that once at its
> beginning of recovery, where we only compare the names of the 2PC file
> names with the XID boundaries in the checkpoint record:
> - Discard any files with an XID newer than the next XID. We know that
> these would be in WAL anyway, if they replay from a timeline where it
> matters.
v3-0002-Improve-handling-of-2PC-files-during-recovery.patch has this continue
to emit a WARNING. However, this can happen without any exceptional events in
the backup's history. The message should be LOG or DEBUG if done this early.
This is not new in the patch, and it would be okay to not change it as part of
$SUBJECT. I mention it because, if we scanned pg_twophase after reaching
consistency, an ERROR would be appropriate. (A too-new pg_twophase file then
signifies the backup creator having copied files after the backup stop, a
violation of the backup protocol.)
> - Discard any files that are older than the oldest XID. We know that
> these files don't matter as 2PC transactions hold the XID horizon.
In contrast, a WARNING is fine here. A too-old file implies one of these
three, each of which counts as an exceptional event:
- an OS crash revived a file after unlink(), since RemoveTwoPhaseFile() does not fsync
- unlink() failed in RemoveTwoPhaseFile()
- backup protocol violation
I agree start-of-recovery can correctly do your list of two discard actions;
they do not require a consistent state. If the patch brings us to the point
that recovery does only those two things with pg_twophase before reaching
consistency, that sounds fine. Doing them that early doesn't sound optimal to
me, but I've not edited that area as much as you have. If it gets the
user-visible behaviors right and doesn't look fragile, I'll be fine with it.
> - Keep the others for later evaluation.
>
> So there's no actual need to check the contents of the files, still
> that implies trusting the names of the 2PC files in pg_twophase/.
Trusting file names is fine.
> > I can think of one benefit of attempting to read pg_twophase before reaching
> > consistency. Suppose we can prove that a pg_twophase file will cause an error
> > by end of recovery, regardless of what WAL contains.
Thinking about that more, few errors are detectable at that time. Here too,
if your patch achieves some of this while getting the user-visible behaviors
right and not looking fragile, I'll be fine with it.
> >> Wouldn't it be OK in this case to assume that the contents of this
> >> file will be in WAL anyway?
> >
> > Sure. Meanwhile, if a twophase file is going to be in later WAL, what's the
> > value in opening the file before we get to that WAL?
>
> True. We could avoid loading some 2PC files if we know that these
> could be in WAL when replaying.
If the 2PC data is in future WAL, the file on disk may be a partially-written
file that fails the CRC check. That's a key benefit of not reading the file.
> There is still one point that I'm really unclear about. Some 2PC
> transactions are flushed at checkpoint time, so we will have to trust
> the contents of pg_twophase/ at some point. Do you mean to delay that
> to happen always when consistency is reached
Yep.
> or if we know that we're
> starting from a clean state?
I don't know what "clean slate" means. If "clean state"=wasShutdown, that's
just a special case of "consistency is reached".
> What I'd really prefer to avoid is
> having two code paths in charge of reading the contents of
> pg_twophase.
That's a good goal. I suspect you'll achieve that.
> The point I am trying to make is that there has to be a certain level
> of trust in the contents of pg_twophase, at some point during replay.
> Or, we invent a new mechanism where all the twophase files go through
> WAL and remove the need for pg_twophase/ when recovering. For
> example, we could have an extra record generated at each checkpoint
> with the contents of the 2PC files still pending for commit
> (potentially costly if the same transaction persists across multiple
> checkpoints as this gets repeated), or something entirely different.
> Something like that would put WAL as the sole source of trust by
> design.
Adding such a record would be nonstandard and unnecessary. Before reaching
consistency, the standard is that recovery reads and writes what WAL directs
it to read and write. Upon reaching consistency, the server can trust the
entire data directory, including parts that WAL never referenced.
> Or are you seeing things differently? In which case, I am not sure
> what's the line you think would be "correct" here (well, you did say
> only WAL until consistency is reached), nor do I completely understand
> how much 2PC transaction state we should have in shared memory until
> consistency is reached.
Shared memory should (not "must") omit a GXACT whose prepare_start_lsn lies
beyond the point recovery has reached. In other words, recovery should keep
anachronistic information out of shared memory. For example, one could
imagine a straw man implementation in which, before reaching consistency, we
load each pg_twophase file that does pass its CRC check. I'd dislike that,
because it would facilitate anachronisms involving the
GlobalTransactionData.gid field. We could end up with two GXACT having the
same gid, one being the present (according to recovery progress) instance of
that gid and another being a future instance. The system might not
malfunction today, but I'd consider that fragile. Anachronistic entries might
cause recovery to need more shared memory than max_prepared_transactions has
allocated.
You might find some more-important goal preempts that no-anachronisms goal.
From | Date | Subject | |
---|---|---|---|
Next Message | Jelte Fennema-Nio | 2025-06-06 22:41:23 | Re: md5 password deprecation might cause problems with PgBouncer setups |
Previous Message | Jeff Davis | 2025-06-06 22:23:41 | Re: Remaining dependency on setlocale() |