| From: | surya poondla <suryapoondla4(at)gmail(dot)com> |
|---|---|
| To: | Mats Kindahl <mats(dot)kindahl(at)gmail(dot)com> |
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: pg_rewind does not rewind diverging timelines |
| Date: | 2026-05-21 22:09:05 |
| Message-ID: | CAOVWO5oZZtniLR4Pyd=e_cS-FNxh837Gbz9TDUnSwWqmbap=bw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Mats,
Thanks for picking this up -- the scenario is a real one and I think the
UUID-tagging approach is a clean way to solve it. v2 applies and builds
without trouble, and the core algorithm reads well to me.
I have a handful of observations that I'd love your thoughts.
Regarding Correctness I have the below thoughts
1. UUIDv7 timestamp epoch.
In StartupXLOG():
TimestampTz now = GetCurrentTimestamp();
generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
(uint32)(now % 1000) * 1000);
I think there might be a small mismatch here: GetCurrentTimestamp() returns
microseconds since the Postgres epoch (2000-01-01),
whereas generate_uuidv7_r describes its first argument as milliseconds
since the Unix epoch.
As written that 30-year offset would land in the UUID's timestamp field, so
the resulting UUID wouldn't be a conformant UUIDv7 and wouldn't
time-order against UUIDv7s generated through the SQL functions.
Uniqueness is preserved either way, so the rewind logic still works as
intended but it seemed worth flagging.
I see conversion that's used elsewhere as:
us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
* SECS_PER_DAY * USECS_PER_SEC;
Or, since promotion isn't on a hot path, gettimeofday() / time(NULL)
directly would also be fine.
2. EOR-record path, the intent is unclear.
The comment above generate_uuidv7_r() at says:
"The same UUID is written into the history file and later into the
XLOG_END_OF_RECOVERY record so that pg_rewind can distinguish two
servers..."
But from what I can see only the history-file part actually lands.
xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add
the UUID, and XLogCtl->ThisTimeLineUUID is written under info_lck without a
reader (I couldn't grep it).
The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like
preparation for an EOR-struct extension that ended up not being part of the
patch.
Was the EOR-record piece something you intended to keep for a follow-up, or
has it been superseded by the history-file approach?
3. Malformed UUID handling in readTimeLineHistory().
The optional field-4 path is:
if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
{
Datum datum = DirectFunctionCall1(uuid_in,
CStringGetDatum(uuid_str));
...
}
uuid_in() raises ereport(ERROR) on a malformed input, while the surrounding
syntax-error paths in readTimeLineHistory() use FATAL deliberately.
In practice an ERROR during startup ends up being fatal too, so this isn't
strictly a bug but it would be nicer to stay consistent.
Regarding the Tests I have the following thoughts
The two new cases are nice, a few extensions that I think would strengthen
them:
1. A mixed-version case where one side has a zero UUID. That's the path
we're claiming is graceful, but nothing currently exercises it
2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that
findCommonAncestorTimeline's loop walks past matching entries
before hitting the mismatch. The 0002 test puts the divergence at
depth 1.
3. A small assertion against the on-disk 00000002.history contents, to pin
down the file format.
4. On 0002 the dependency on restore_command pointing at node_x's pg_wal is
the kind of thing that tends to break under
environment changes. A CHECKPOINT on node_x before the backup, or
wal_keep_size as in 0001, would let the test stand on its own.
I'm happy to keep reviewing/contributing, thanks again for working on it.
Regards,
Surya Poondla
>
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Dave Cramer | 2026-05-21 21:55:35 | Re: Patch for bind message regarding the number of parameters and result column format codes |