| From: | Mats Kindahl <mats(dot)kindahl(at)gmail(dot)com> |
|---|---|
| To: | Japin Li <japinli(at)hotmail(dot)com> |
| Cc: | surya poondla <suryapoondla4(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: pg_rewind does not rewind diverging timelines |
| Date: | 2026-05-26 16:03:32 |
| Message-ID: | CAN305gCaErXmG3fg48n50dWUC7=ETBBopuFL_cgyzutXUdp-5g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Japin,
Thanks for reviewing the patch.
On Tue, May 26, 2026 at 8:56 AM Japin Li <japinli(at)hotmail(dot)com> wrote:
>
> Hi, Mats
>
> Thanks for updating the patch.
>
> On Mon, 25 May 2026 at 20:59, Mats Kindahl <mats(dot)kindahl(at)gmail(dot)com> wrote:
> > Hi Japin,
> >
> > On Mon, May 25, 2026 at 7:21 AM Japin Li <japinli(at)hotmail(dot)com> wrote:
> >
> > Hi, Mats
> >
> > On Sun, 24 May 2026 at 20:30, Mats Kindahl <mats(dot)kindahl(at)gmail(dot)com>
> wrote:
> > > On Fri, May 22, 2026 at 12:09 AM surya poondla <
> suryapoondla4(at)gmail(dot)com> wrote:
> > >
> > > Hi Mats,
> > >
> > > Thanks for picking this up -- the scenario is a real one and I think
> the UUID-tagging approach is a clean way to
> > > solve it. v2 applies and builds without trouble, and the core
> algorithm reads well to me.
> > > I have a handful of observations that I'd love your thoughts.
> > >
> > > Hi Surya,
> > >
> > > Thank you for the review. It is a quite rare scenario, but it is real
> and the fix is simple.
> > >
> > > Regarding Correctness I have the below thoughts
> > >
> > > 1. UUIDv7 timestamp epoch.
> > > In StartupXLOG():
> > > TimestampTz now = GetCurrentTimestamp();
> > > generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
> > > (uint32)(now % 1000) * 1000);
> > >
> > > I think there might be a small mismatch here: GetCurrentTimestamp()
> returns microseconds since the Postgres epoch
> > > (2000-01-01),
> > > whereas generate_uuidv7_r describes its first argument as
> milliseconds since the Unix epoch.
> > > As written that 30-year offset would land in the UUID's timestamp
> field, so the resulting UUID wouldn't be a
> > > conformant UUIDv7 and wouldn't
> > > time-order against UUIDv7s generated through the SQL functions.
> > >
> > >
> > >
> > > Uniqueness is preserved either way, so the rewind logic still works
> as intended but it seemed worth flagging.
> > >
> > > I see conversion that's used elsewhere as:
> > > us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
> > > * SECS_PER_DAY * USECS_PER_SEC;
> > >
> > > Or, since promotion isn't on a hot path, gettimeofday() / time(NULL)
> directly would also be fine.
> > >
> > > Yes, the intention was to use a proper timestamp to allow debugging
> servers if necessary. Switched to gettimeofday
> > () and
> > > used 0 for sub-ms since this is not going to be critical. (We could
> use ns here as well, but that would only solve
> > a race
> > > if you have two servers being promoted in the same ms, which I find
> unlikely, and there is a random number added
> > for that
> > > situation.)
> > >
> > > 2. EOR-record path, the intent is unclear.
> > >
> > > The comment above generate_uuidv7_r() at says:
> > >
> > > "The same UUID is written into the history file and later into the
> XLOG_END_OF_RECOVERY record so that pg_rewind
> > can
> > > distinguish two servers..."
> > >
> > > But from what I can see only the history-file part actually lands.
> > > xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't
> add the UUID, and XLogCtl->ThisTimeLineUUID
> > is
> > > written under info_lck without a
> > > reader (I couldn't grep it).
> > >
> > > The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads
> like preparation for an EOR-struct extension
> > that
> > > ended up not being part of the patch.
> > >
> > > Was the EOR-record piece something you intended to keep for a
> follow-up, or has it been superseded by the
> > > history-file approach?
> > >
> > > No, the EOR changes are not needed for the promotion, contrary to
> what I originally thought. Cleaned up the comment
> > and
> > > the code and removed all traces of changes to the EOR (I hope).
> > >
> > >
> > >
> > > 3. Malformed UUID handling in readTimeLineHistory().
> > >
> > > The optional field-4 path is:
> > >
> > > if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
> > > {
> > > Datum datum = DirectFunctionCall1(uuid_in,
> > >
> CStringGetDatum(uuid_str));
> > > ...
> > > }
> > >
> > > uuid_in() raises ereport(ERROR) on a malformed input, while the
> surrounding syntax-error paths in
> > readTimeLineHistory
> > > () use FATAL deliberately.
> > > In practice an ERROR during startup ends up being fatal too, so this
> isn't strictly a bug but it would be nicer to
> > > stay consistent.
> > >
> > > Agree. I added code to capture the error and raise a FATAL instead
> (with the error message from the uuid_in, in
> > case it
> > > is modified it makes sense to show this).
> > >
> > > Regarding the Tests I have the following thoughts
> > >
> > > The two new cases are nice, a few extensions that I think would
> strengthen them:
> > > 1. A mixed-version case where one side has a zero UUID. That's the
> path we're claiming is graceful, but nothing
> > > currently exercises it
> > >
> > > Yes, that should work regardless of whether the source or the target
> has the zero UUID.
> > >
> > > I realized one thing: if two timelines have identical TLI but one has
> zero UUID and one has not, it seems they
> > could not
> > > come from the same promotion (one promotion happened on an old server
> and the other one on a new server), that is,
> > they
> > > should be treated as different. Does that make sense? I made the
> necessary changes in the attached patches for
> > testing.
> > > Please have a look.
> > >
> > > 2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that
> findCommonAncestorTimeline's loop walks past
> > > matching entries
> > > before hitting the mismatch. The 0002 test puts the divergence
> at depth 1.
> > >
> > > I was unsure if this test was necessary or interesting, hence a
> separate commit. Since you thought it was useful,
> > it's
> > > now rolled into the patch and I extended the tests with the scenarios
> you suggested.
> > >
> > > I also did some refactorings of the tests to avoid duplication. More
> below.
> > >
> > > 3. A small assertion against the on-disk 00000002.history contents,
> to pin down the file format.
> > > 4. On 0002 the dependency on restore_command pointing at node_x's
> pg_wal is the kind of thing that tends to break
> > > under
> > > environment changes. A CHECKPOINT on node_x before the backup,
> or wal_keep_size as in 0001, would let the
> > test
> > > stand on its own.
> > >
> > > Good point.
> > >
> > > I refactored the code to avoid some duplication and make the test
> flow self-explanatory and as part of that I set
> > the
> > > wal_keep_size for all nodes.
> > >
> > > In the process I noticed that many of the functions in RewindTest.pm
> do the same job as the primitives I wrote, but
> > have
> > > hard-coded variable names. I could rewrite them to take parameters,
> but that would be quite a big patch to add
> > additional
> > > changes to each call site, so I did not do that and rather added
> small wrappers specific for the tests in
> > > 005_same_timeline.pl⚠️⚠️.
> > >
> > > Attached a new version of the now single patch.
> > >
> > > I'm happy to keep reviewing/contributing, thanks again for working
> on it.
> > >
> > > Thank you for reviewing it.
> >
> > Thank you for your work. I have one comment.
> >
> > + a = &tlh->source[tlh->sourceNentries - 2].tluuid;
> > + b = &tlh->target[tlh->targetNentries - 2].tluuid;
> > +
> > + if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero,
> UUID_LEN) == 0)
> > + return true;
> > +
> > + return memcmp(a, b, UUID_LEN) == 0;
> >
> > Since we already have matchingTimelineUUID(), the above code can be
> simplified
> > using it.
> >
> > Thank you for the review. I switched to using the matchingTimelineUUID()
> for this part of the code and made some other
> > minor improvements as well.
>
> Here are some comments on v4.
>
> 1.
> +/*
> + * Timeline histories for both clusters, populated by timelines_match().
> + */
>
> I don't see a timelines_match() function. Does this refer to
> matchAndFetchTimelines()?
>
Correct. Updated.
>
> 2.
> +typedef struct TimelineHistoriesData
> +{
> + TimeLineHistoryEntry *source,
> + *target;
> + int sourceNentries,
> + targetNentries;
> +} TimelineHistoriesData;
>
> I'd prefer to use TimeLineHistoriesData to stay consistent with
> TimeLineHistoryEntry. Anyway I'm not instant on it.
>
Makes sense to be consistent. Updated.
>
> 3.
> +typedef TimelineHistoriesData * TimelineHistories;
>
> The space between * and TimelineHistories is unnecessary — see
> StringInfoData and other typedefs.
>
My mistake. FIxed.
> 4.
> +# node_x and node_b both start from the same TLI 1 baseline.
> +my ($node_x, $node_b2) =
> + setup_standbys_from_origin($node_origin2, 'node_x', 'node_b2');
>
> There appears to be a typo in the comment. The node_b should be node_b2.
>
Right. Fixed.
>
>
> Everything else looks good. Thank you again for updating the patch!
>
Thank you again for reviewing the patch. :)
Attached a new version of the patch with the changes you suggested.
--
Best wishes,
Mats Kindahl, Multigres Developer, Supabase
| Attachment | Content-Type | Size |
|---|---|---|
| v5.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patch | text/x-patch | 37.6 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nathan Bossart | 2026-05-26 16:05:47 | future of PQfn() |
| Previous Message | Jacob Champion | 2026-05-26 15:54:04 | Re: Removing broken support for OpenSSL without ECDH |