Re: pg_rewind does not rewind diverging timelines

From: Mats Kindahl <mats(dot)kindahl(at)gmail(dot)com>
To: surya poondla <suryapoondla4(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: pg_rewind does not rewind diverging timelines
Date: 2026-05-24 18:30:17
Message-ID: CAN305gCcj4Mhr3uBQAnQCYsx6F-syp1rGtazoy=h+_EHO0xOXA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 22, 2026 at 12:09 AM surya poondla <suryapoondla4(at)gmail(dot)com>
wrote:

> Hi Mats,
>
> Thanks for picking this up -- the scenario is a real one and I think the
> UUID-tagging approach is a clean way to solve it. v2 applies and builds
> without trouble, and the core algorithm reads well to me.
> I have a handful of observations that I'd love your thoughts.
>

Hi Surya,

Thank you for the review. It is a quite rare scenario, but it is real and
the fix is simple.

> Regarding Correctness I have the below thoughts
>
> 1. UUIDv7 timestamp epoch.
> In StartupXLOG():
> TimestampTz now = GetCurrentTimestamp();
> generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
> (uint32)(now % 1000) * 1000);
>
> I think there might be a small mismatch here: GetCurrentTimestamp()
> returns microseconds since the Postgres epoch (2000-01-01),
> whereas generate_uuidv7_r describes its first argument as milliseconds
> since the Unix epoch.
> As written that 30-year offset would land in the UUID's timestamp field,
> so the resulting UUID wouldn't be a conformant UUIDv7 and wouldn't
> time-order against UUIDv7s generated through the SQL functions.
>

>
> Uniqueness is preserved either way, so the rewind logic still works as
> intended but it seemed worth flagging.
>

> I see conversion that's used elsewhere as:
> us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
> * SECS_PER_DAY * USECS_PER_SEC;
>
> Or, since promotion isn't on a hot path, gettimeofday() / time(NULL)
> directly would also be fine.
>

Yes, the intention was to use a proper timestamp to allow debugging servers
if necessary. Switched to gettimeofday() and used 0 for sub-ms since this
is not going to be critical. (We could use ns here as well, but that would
only solve a race if you have two servers being promoted in the same ms,
which I find unlikely, and there is a random number added for that
situation.)

> 2. EOR-record path, the intent is unclear.
>
> The comment above generate_uuidv7_r() at says:
>
> "The same UUID is written into the history file and later into the
> XLOG_END_OF_RECOVERY record so that pg_rewind can distinguish two
> servers..."
>
> But from what I can see only the history-file part actually lands.
> xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add
> the UUID, and XLogCtl->ThisTimeLineUUID is written under info_lck without a
> reader (I couldn't grep it).
>
> The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like
> preparation for an EOR-struct extension that ended up not being part of the
> patch.
>
> Was the EOR-record piece something you intended to keep for a follow-up,
> or has it been superseded by the history-file approach?
>

No, the EOR changes are not needed for the promotion, contrary to what I
originally thought. Cleaned up the comment and the code and removed all
traces of changes to the EOR (I hope).

>
>
> 3. Malformed UUID handling in readTimeLineHistory().
>
> The optional field-4 path is:
>
> if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
> {
> Datum datum = DirectFunctionCall1(uuid_in,
> CStringGetDatum(uuid_str));
> ...
> }
>
> uuid_in() raises ereport(ERROR) on a malformed input, while the
> surrounding syntax-error paths in readTimeLineHistory() use FATAL
> deliberately.
> In practice an ERROR during startup ends up being fatal too, so this isn't
> strictly a bug but it would be nicer to stay consistent.
>

Agree. I added code to capture the error and raise a FATAL instead (with
the error message from the uuid_in, in case it is modified it makes sense
to show this).

> Regarding the Tests I have the following thoughts
>
> The two new cases are nice, a few extensions that I think would strengthen
> them:
> 1. A mixed-version case where one side has a zero UUID. That's the path
> we're claiming is graceful, but nothing currently exercises it
>

Yes, that should work regardless of whether the source or the target has
the zero UUID.

I realized one thing: if two timelines have identical TLI but one has zero
UUID and one has not, it seems they could not come from the same promotion
(one promotion happened on an old server and the other one on a new
server), that is, they should be treated as different. Does that make
sense? I made the necessary changes in the attached patches for testing.
Please have a look.

> 2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that
> findCommonAncestorTimeline's loop walks past matching entries
> before hitting the mismatch. The 0002 test puts the divergence at
> depth 1.
>

I was unsure if this test was necessary or interesting, hence a separate
commit. Since you thought it was useful, it's now rolled into the patch and
I extended the tests with the scenarios you suggested.

I also did some refactorings of the tests to avoid duplication. More below.

> 3. A small assertion against the on-disk 00000002.history contents, to pin
> down the file format.
> 4. On 0002 the dependency on restore_command pointing at node_x's pg_wal
> is the kind of thing that tends to break under
> environment changes. A CHECKPOINT on node_x before the backup, or
> wal_keep_size as in 0001, would let the test stand on its own.
>

Good point.

I refactored the code to avoid some duplication and make the test flow
self-explanatory and as part of that I set the wal_keep_size for all nodes.

In the process I noticed that many of the functions in RewindTest.pm do the
same job as the primitives I wrote, but have hard-coded variable names. I
could rewrite them to take parameters, but that would be quite a big patch
to add additional changes to each call site, so I did not do that and
rather added small wrappers specific for the tests in 005_same_timeline.pl.

Attached a new version of the now single patch.

> I'm happy to keep reviewing/contributing, thanks again for working on it.
>

Thank you for reviewing it.
--
Best wishes,
Mats Kindahl, Multigres Developer, Supabase

Attachment Content-Type Size
v3.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patch text/x-patch 37.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Enrique Sánchez 2026-05-24 18:34:52 Re: Extended statistics improvement: multi-column MCV missing values
Previous Message Philip Alger 2026-05-24 17:31:55 Re: Rename Postgres 19 to Postgres 26 (year-based)?