Re: pg_rewind does not rewind diverging timelines

From: Mats Kindahl <mats(dot)kindahl(at)gmail(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: pgsql-hackers mailing list <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_rewind does not rewind diverging timelines
Date: 2026-06-21 09:09:35
Message-ID: 10351a09-3a93-49fc-a366-d616b67c6ede@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/8/26 12:48, Andrey Borodin wrote:

>> On 30 Apr 2026, at 13:19, Mats Kindahl<mats(dot)kindahl(at)gmail(dot)com> wrote:
>>
>> There is one scenario that I assume is known that TLC found, but does not seem to be fixed. It is a relatively rare case, but since the fix is quite easy, I thought I'd share it with you and get feedback.
> Hi Mats,

Hi Andrey,

Thanks for looking at this.

> Thanks for working on this. I think the problem is real, but I wonder if
> adding a separate UUID to timeline history files is solving it one step
> too late.
>
> If two independent promotions manage to choose the same numeric TLI, then
> we already have two different histories with the same timeline identifier.
> Their history files will also have the same name. A UUID in the file lets
> tools detect the mismatch afterwards, but it does not prevent the archive
> namespace from containing two different meanings for the same TLI.

Yes, that is correct.

> In normal deployments with a shared archive this should only be possible
> when the history file is not visible to the other promoting server:
> either there is no usable restore_command/shared archive, or there is a
> race around publishing and observing the history file. In other words, TLI
> allocation is not atomic, but it is intended to be coordinated through the
> archive.

Yes, that is the ideal way it should work when you have a shared
archive. This works because you have a central authority that
synchronizes the timelines (in theory, not counting bugs).

> Maybe we should keep TimelineID as the actual branch identifier and make
> that allocation harder to collide instead of adding a second identifier.
> For example, when choosing a new TLI, add some randomness rather than just
> using the next sequential value.
> That would make the race window much less
> dangerous: two independent promotions would be extremely unlikely to
> choose the same TLI, the history file names would remain distinct, and TLI
> would keep its current role as the timeline identifier.
> This also keeps the operational model simpler. TimelineID is already the
> identifier exposed in WAL file names, history file names, logs, and
> recovery configuration. If we add UUIDs, we effectively introduce another
> identity for the same object, and tools then need to reason about both.
> If instead we make TLI allocation less deterministic under races, the
> existing model remains intact.
>
> Does that framing make sense, or am I missing a case where duplicate TLIs
> are unavoidable even with a shared archive and a less collision-prone
> allocation scheme?

I considered using some random increment of the TLI in the manner you
describe but there are some issues that makes this solution more
complicated from an operational perspective:

* If you skip some TLIs (in the sense pick a TLI that is "random but
larger"), then it is not clear what the relation between them are.
o The history files contain the complete linkage of the timelines,
so that is covered, but the naming would be strange.
+ For example, if you have history files 1, 5, 7, and 8, then
these can all belong to different timelines, (except 1), or
be a single timeline and it is hard to understand which one
without looking through the files.
o With more promotions, the relation becomes even more strange,
and the risk of collisions increases. (For example, imagine one
timeline with 1, 5, 7, 8, 11, and one timeline that forks off 1.
Then any increment of 4, 6, 7, or 10 will result in a collision.)
* To actually reduce the risk significantly, you need to have a very
wide range of the added randomness. Taking a smaller number is
easier to work with, but then you need to handle that some timelines
can collide in some manner.
* Normally, the history file with the highest number will be the only
relevant one. With this approach, you have to check the contents of
the files to understand which ones are relevant, which increases the
operational burden.

In contrast, if you use an UUID in this manner.

* Adding an UUID does not require a central coordinator and is not
likely to collide (on the level "impossible to collide") and is very
straightforward to add. It also comes with a low risk since the
places in the code that requires changes are very few and not likely
to have unexpected consequences elsewhere. This works both with and
without a shared archive.
* Normally, a shared archive should only contain a single timeline.
Anything else is an anomaly and should be corrected.
* I think it is still necessary to handle the case where you do not
have a shared archive; it would be an odd limitation to say that
promote only works if you have a shared archive
* The UUID still serves a purpose in capturing a situation where
things have gone wrong. Think of the UUID as similar to a "checksum"
safety and an extra precaution to prevent things from going wrong.

In short, I think the operational issues with random increment of the
history file number is worse, not better, and we should deal with the
name collisions correctly for shared archives instead. There is an issue
in that it need to work even in the case where you have a promotion that
generates a new UUID but the correct history file exists (reported in
the other message) that I will look into.

Best wishes,
Mats Kindahl

> Best regards, Andrey Borodin.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniil Davydov 2026-06-21 09:13:09 Re: BUG with accessing to temporary tables of other sessions still exists
Previous Message ZizhuanLiu X-MAN 2026-06-21 08:25:46 Re: BUG with accessing to temporary tables of other sessions still exists