I'm sorry that the content of the previous email was incomplete. Let me continue with the issue I've recently encountered.
Previously, the following logic was added to the PrepareRedoAdd() function when fixing this 2PC recovery bug:
+ if (!XLogRecPtrIsInvalid(start_lsn))
+ {
+ char path[MAXPGPATH];
+
+ TwoPhaseFilePath(path, hdr->xid);
+
+ if (access(path, F_OK) == 0)
+ {
+ ereport(reachedConsistency ? ERROR : WARNING,
+ (errmsg("could not recover two-phase state file for transaction %u",
+ hdr->xid),
+ errdetail("Two-phase state file has been found in WAL record %X/%X, but thi
s transaction has already been restored from disk.",
+ LSN_FORMAT_ARGS(start_lsn))));
+ return;
+ }
+ }
Every time we add a 2PC from a WAL record, we check whether the corresponding 2PC file exists on the storage. When a consistent state is reached, it is highly likely that there will be no corresponding files on the storage, so it is probable that an empty file will be accessed. Each time a file is accessed, the OS creates a dentry for it and increases the reference count of the parent directory's dentry. The type of the dentry reference count is int32. When there are many 2PC transactions, this logic may lead to an overflow of the parent dentry's reference count. When reclaiming a dentry, such as during memory reclamation or when deleting the directory, the overflow of the reference count could ultimately cause the OS to crash.
I am considering whether the logic in this part can be modified as follows:
+ if (!XLogRecPtrIsInvalid(start_lsn) && !reachedConsistency)
+ {
+ char path[MAXPGPATH];
+
+ TwoPhaseFilePath(path, hdr->xid);
+
+ if (access(path, F_OK) == 0)
+ {
+ ereport(WARNING,
+ (errmsg("could not recover two-phase state file for transaction %u",
+ hdr->xid),
+ errdetail("Two-phase state file has been found in WAL record %X/%X, but thi
s transaction has already been restored from disk.",
+ LSN_FORMAT_ARGS(start_lsn))));
+ return;
+ }
+ }
Currently, since restoreTwoPhaseData() is the only function that restores 2PC transactions from disk before the recovery starts, after reaching a consistent state, the 2PC transactions are only added from the WAL. Under normal circumstances, there should not be any corresponding 2PC files on the storage at that point. Therefore, I prefer to perform the file access checks only when the server has not yet reached a consistent state. Once consistency has been reached, if a duplicate 2PC transaction is added, it will directly result in an error in the subsequent replay logic.
Do you think this modification is feasible? I look forward to your feedback
Best Regards
suyu.cmj
------------------------------------------------------------------
From:CAI, Mengjuan <mengjuan(dot)cmj(at)alibaba-inc(dot)com>
Send Time:2025 Jul. 9 (Wed.) 09:57
To:Michael Paquier<michael(at)paquier(dot)xyz>
CC:"pgsql-bugs"<pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:Re: The same 2PC data maybe recovered twice
Hi, all. I would like to sync up on an issue I've recently encountered. Previously, the following logic was added to the PrepareRedoAdd() when fixing this 2PC recovery bug:
------------------------------------------------------------------
From:Michael Paquier <michael(at)paquier(dot)xyz>
Send Time:2023 Jul. 18 (Tue.) 13:14
To:"蔡梦娟(玊于)"<mengjuan(dot)cmj(at)alibaba-inc(dot)com>
CC:"zhihui.fan1213"<zhihui(dot)fan1213(at)gmail(dot)com>; "pgsql-bugs"<pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:Re: The same 2PC data maybe recovered twice
On Mon, Jul 17, 2023 at 05:20:00PM +0800, suyu.cmj wrote:
> Thanks for the feedback! I have updated the patch, hope you can check it.
I have looked at v3, and noticed that the stat() call is actually a
bit incorrect in its error handling because it would miss any errors
that happen when checking for the existence of the file. The only
error that we should safely expect is ENOENT, for a missing entry.
All the other had better fail like the other code paths restoring 2PC
entries from the shared state. At the end, I have made the choice of
relying only on access() to check the existence of the file as this is
an internal place, simplified a bit the comment. Finally, I have made
the choice to remove the assert-only check. After sleeping on it, it
would have value in very limited cases while a bunch of recovery cases
would take a hit, including developers with large 2PC setups, so there
are a lot of downsides with limited upsides.
--
Michael