|Michael Paquier <michael(at)paquier(dot)xyz>
|Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
|Re: The same 2PC data maybe recovered twice
|Raw Message | Whole Thread | Download mbox | Resend email
On Wed, Jul 12, 2023 at 03:20:57PM +0800, suyu.cmj wrote:
> Yes, this bug can also be reproduced on the master branch, and the
> corresponding reproduction patch is attached.
That's an interesting reproducer with injection points. It looks like
you've spent a lot of time investigating that. So, basically, a
checkpoint fails after writing a 2PC file to disk, but before the redo
LSN has been updated.
> I also considered comparing the 2pc.prepare_start_lsn and
> checkpoint.redo in ProcessTwoPhaseBuffer before, but this method
> requires modifying the format of the 2pc checkpoint file, which will
> bring compatibility issues. Especially for released branches,
> assuming that a node has encountered this bug, it will not be able
> to start successfully due to FATAL during crash recovery, and
> therefore cannot manually commit previous two-phase
> transactions. Using magic number to distinguish 2pc checkpoint file
> versions can't solve the problem in the above scenario either.
> For unreleased branches, writing 2pc.prepare_start_lsn into the
> checkpoint file will be a good solution, but for released branches,
Yes, changing anything in this format is a no-go. Now, things could
be written so as the recovery code is able to handle multiple formats,
meaning that it would be able to feed from the a new format that
includes a LSN or something else for the comparison, but that would
not save from the case where 2PC files with the old format are still
around and a 2PC WAL record is replayed.
> I personally think using WAL record to overwrite checkpoint data
> would be a more reasonable approach, What do you think?
The O(2) loop added in PrepareRedoAdd() to scan the set of 2PC
transactions stored in TwoPhaseState for the purpose of checking for a
duplicate sucks from a performance point of view, particularly for
deployments with many 2PC transactions allowed. It could delay
recovery a lot. And actually, this is not completely correct, no?
It is OK to bypass the recovery of the same transaction if the server
has not reached a consistent state, but getting a duplicate when
consistency has been reached should lead to a hard failure.
One approach to avoid this O(2) would be to use a hash table to store
the 2PC entries, for example, rather than an array. That would be
simple enough but such refactoring is rather scary from the point of
view of recovery.
And, actually, we could do something much more simpler than what's
been proposed on this thread.. PrepareRedoAdd() would be called when
scanning pg_twophase at the beginning of recovery, or when replaying a
PREPARE record, both aiming at adding an entry in shmem for the 2PC
transaction tracked. Here is a simpler idea: why don't we just check
in PrepareRedoAdd() if the 2PC file of the transaction being recovery
is in pg_twophase/ when adding an entry from a WAL record? If a
consistent point has *not* been reached by recovery and we find a file
on disk, then do nothing because we *know* thanks to
restoreTwoPhaseData() done at the beginning of recover that there is
an entry for this file. If a consistent point has been reached in
recovery and we find a file on disk while replaying a WAL record for
the same 2PC file, then fail. If there is no file in pg_twophase for
the record replayed, then add it to the array TwoPhaseState.
Adding a O(2) loop that checks for duplicates may be a good idea as a
cross-check if replaying a record, but I'd rather put that under an
USE_ASSERT_CHECKING so as there is no impact on production systems,
still we'd have some sanity checks for test setups.
|Re: BUG #18019: misbehaviour by replication
|Re: BUG #17994: Invalidating relcache corrupts tupDesc inside ExecEvalFieldStoreDeForm()
|Re: Consistent coding for the naming of LR workers
|Re: CREATE INDEX CONCURRENTLY on partitioned index