Re: pg_rewind does not rewind diverging timelines

From: Japin Li <japinli(at)hotmail(dot)com>
To: Mats Kindahl <mats(dot)kindahl(at)gmail(dot)com>
Cc: surya poondla <suryapoondla4(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: pg_rewind does not rewind diverging timelines
Date: 2026-05-26 06:56:09
Message-ID: SY7PR01MB109216DF36CF987CF6EF10B29B60B2@SY7PR01MB10921.ausprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi, Mats

Thanks for updating the patch.

On Mon, 25 May 2026 at 20:59, Mats Kindahl <mats(dot)kindahl(at)gmail(dot)com> wrote:
> Hi Japin,
>
> On Mon, May 25, 2026 at 7:21 AM Japin Li <japinli(at)hotmail(dot)com> wrote:
>
> Hi, Mats
>
> On Sun, 24 May 2026 at 20:30, Mats Kindahl <mats(dot)kindahl(at)gmail(dot)com> wrote:
> > On Fri, May 22, 2026 at 12:09 AM surya poondla <suryapoondla4(at)gmail(dot)com> wrote:
> >
> > Hi Mats,
> >
> > Thanks for picking this up -- the scenario is a real one and I think the UUID-tagging approach is a clean way to
> > solve it. v2 applies and builds without trouble, and the core algorithm reads well to me.
> > I have a handful of observations that I'd love your thoughts.
> >
> > Hi Surya,
> >
> > Thank you for the review. It is a quite rare scenario, but it is real and the fix is simple.
> >
> > Regarding Correctness I have the below thoughts
> >
> > 1. UUIDv7 timestamp epoch.
> > In StartupXLOG():
> > TimestampTz now = GetCurrentTimestamp();
> > generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
> > (uint32)(now % 1000) * 1000);
> >
> > I think there might be a small mismatch here: GetCurrentTimestamp() returns microseconds since the Postgres epoch
> > (2000-01-01),
> > whereas generate_uuidv7_r describes its first argument as milliseconds since the Unix epoch.
> > As written that 30-year offset would land in the UUID's timestamp field, so the resulting UUID wouldn't be a
> > conformant UUIDv7 and wouldn't
> > time-order against UUIDv7s generated through the SQL functions.
> >
> >
> >
> > Uniqueness is preserved either way, so the rewind logic still works as intended but it seemed worth flagging.
> >
> > I see conversion that's used elsewhere as:
> > us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
> > * SECS_PER_DAY * USECS_PER_SEC;
> >
> > Or, since promotion isn't on a hot path, gettimeofday() / time(NULL) directly would also be fine.
> >
> > Yes, the intention was to use a proper timestamp to allow debugging servers if necessary. Switched to gettimeofday
> () and
> > used 0 for sub-ms since this is not going to be critical. (We could use ns here as well, but that would only solve
> a race
> > if you have two servers being promoted in the same ms, which I find unlikely, and there is a random number added
> for that
> > situation.)
> >
> > 2. EOR-record path, the intent is unclear.
> >
> > The comment above generate_uuidv7_r() at says:
> >
> > "The same UUID is written into the history file and later into the XLOG_END_OF_RECOVERY record so that pg_rewind
> can
> > distinguish two servers..."
> >
> > But from what I can see only the history-file part actually lands.
> > xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add the UUID, and XLogCtl->ThisTimeLineUUID
> is
> > written under info_lck without a
> > reader (I couldn't grep it).
> >
> > The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like preparation for an EOR-struct extension
> that
> > ended up not being part of the patch.
> >
> > Was the EOR-record piece something you intended to keep for a follow-up, or has it been superseded by the
> > history-file approach?
> >
> > No, the EOR changes are not needed for the promotion, contrary to what I originally thought. Cleaned up the comment
> and
> > the code and removed all traces of changes to the EOR (I hope).
> >
> >
> >
> > 3. Malformed UUID handling in readTimeLineHistory().
> >
> > The optional field-4 path is:
> >
> > if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
> > {
> > Datum datum = DirectFunctionCall1(uuid_in,
> > CStringGetDatum(uuid_str));
> > ...
> > }
> >
> > uuid_in() raises ereport(ERROR) on a malformed input, while the surrounding syntax-error paths in
> readTimeLineHistory
> > () use FATAL deliberately.
> > In practice an ERROR during startup ends up being fatal too, so this isn't strictly a bug but it would be nicer to
> > stay consistent.
> >
> > Agree. I added code to capture the error and raise a FATAL instead (with the error message from the uuid_in, in
> case it
> > is modified it makes sense to show this).
> >
> > Regarding the Tests I have the following thoughts
> >
> > The two new cases are nice, a few extensions that I think would strengthen them:
> > 1. A mixed-version case where one side has a zero UUID. That's the path we're claiming is graceful, but nothing
> > currently exercises it
> >
> > Yes, that should work regardless of whether the source or the target has the zero UUID.
> >
> > I realized one thing: if two timelines have identical TLI but one has zero UUID and one has not, it seems they
> could not
> > come from the same promotion (one promotion happened on an old server and the other one on a new server), that is,
> they
> > should be treated as different. Does that make sense? I made the necessary changes in the attached patches for
> testing.
> > Please have a look.
> >
> > 2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that findCommonAncestorTimeline's loop walks past
> > matching entries
> > before hitting the mismatch. The 0002 test puts the divergence at depth 1.
> >
> > I was unsure if this test was necessary or interesting, hence a separate commit. Since you thought it was useful,
> it's
> > now rolled into the patch and I extended the tests with the scenarios you suggested.
> >
> > I also did some refactorings of the tests to avoid duplication. More below.
> >
> > 3. A small assertion against the on-disk 00000002.history contents, to pin down the file format.
> > 4. On 0002 the dependency on restore_command pointing at node_x's pg_wal is the kind of thing that tends to break
> > under
> > environment changes. A CHECKPOINT on node_x before the backup, or wal_keep_size as in 0001, would let the
> test
> > stand on its own.
> >
> > Good point.
> >
> > I refactored the code to avoid some duplication and make the test flow self-explanatory and as part of that I set
> the
> > wal_keep_size for all nodes.
> >
> > In the process I noticed that many of the functions in RewindTest.pm do the same job as the primitives I wrote, but
> have
> > hard-coded variable names. I could rewrite them to take parameters, but that would be quite a big patch to add
> additional
> > changes to each call site, so I did not do that and rather added small wrappers specific for the tests in
> > 005_same_timeline.pl⚠️⚠️.
> >
> > Attached a new version of the now single patch.
> >
> > I'm happy to keep reviewing/contributing, thanks again for working on it.
> >
> > Thank you for reviewing it.
>
> Thank you for your work. I have one comment.
>
> + a = &tlh->source[tlh->sourceNentries - 2].tluuid;
> + b = &tlh->target[tlh->targetNentries - 2].tluuid;
> +
> + if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero, UUID_LEN) == 0)
> + return true;
> +
> + return memcmp(a, b, UUID_LEN) == 0;
>
> Since we already have matchingTimelineUUID(), the above code can be simplified
> using it.
>
> Thank you for the review. I switched to using the matchingTimelineUUID() for this part of the code and made some other
> minor improvements as well.

Here are some comments on v4.

1.
+/*
+ * Timeline histories for both clusters, populated by timelines_match().
+ */

I don't see a timelines_match() function. Does this refer to
matchAndFetchTimelines()?

2.
+typedef struct TimelineHistoriesData
+{
+ TimeLineHistoryEntry *source,
+ *target;
+ int sourceNentries,
+ targetNentries;
+} TimelineHistoriesData;

I'd prefer to use TimeLineHistoriesData to stay consistent with
TimeLineHistoryEntry. Anyway I'm not instant on it.

3.
+typedef TimelineHistoriesData * TimelineHistories;

The space between * and TimelineHistories is unnecessary — see
StringInfoData and other typedefs.

4.
+# node_x and node_b both start from the same TLI 1 baseline.
+my ($node_x, $node_b2) =
+ setup_standbys_from_origin($node_origin2, 'node_x', 'node_b2');

There appears to be a typo in the comment. The node_b should be node_b2.

Everything else looks good. Thank you again for updating the patch!

> --
> Best wishes,
> Mats Kindahl, Multigres Developer, Supabase

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zizhuan Liu 2026-05-26 07:05:37 Re:[PATCH] Little refactoring of portalcmds.c
Previous Message Chao Li 2026-05-26 06:55:01 Re: Fix bug of UPDATE/DELETE FOR PORTION OF with inheritance tables