Re: Commit Sequence Numbers and Visibility

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Commit Sequence Numbers and Visibility
Date: 2026-06-11 22:24:52
Message-ID: d85a1beeabd971e2dbc8d85f209a480f5f449c2e.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2026-06-11 at 10:28 +0200, Matthias van de Meent wrote:
>
> Yes, "wait for data I'm reading to match my own durability
> requirement" should be configurable.
>
> But I'm not sure if we want to do that at snapshot acquisition time,
> or only when the data is being read that was modified by said
> transaction -- the latter would be a very attractive optimization for
> workloads with different durability expectations which touch
> completely -or, mostly- disjoint datasets in the same database.

Agreed, there are several options here and I think it's an area where
configuration makes sense.

> > >
> > That might be fine, but it's different from CSNs as I understand
> > the
> > meaning (at least in the simplest and most intuitive meaning).
> > Should
> > we call them "Visibility Sequence Numbers" (VSNs) instead?
>
> I think that'd help the difference between durability and visibility,
> yes.

WAL is such a central part of our overall architecture that I think a
simpler visibility foundation (i.e. real CSNs) will pay off in a big
way.

I see "Visibility Sequence Numbers" as a somewhat-unfortunate
compromise. VSNs get the right information in WAL to keep the primary
and replica consistent, which is certainly good; but they leave us with
visibility semantics that don't match the commit order, which is a
source of confusion.

If we have the right set of configuration knobs as suggested above (to
handle mixed async/sync workloads), do you still think we should pursue
VSNs over CSNs?

> Correct, but are we also targeting true snapshot transferability from
> primary to replicas?

That would be a significant benefit of CSNs over VSNs. There would
still be some details to work out about timelines, and maybe other
things I haven't considered, but CSNs get us a lot closer.

> Note that I believe that there is no meaningful difference from a
> consistency standpoint between T1 and T2 both becoming visible at the
> same time (with the same VSN) during recovery, and the system not
> having any snapshot acquired between their VSNs -- which is something
> that could happen on replicas. Side effects of a transaction won't
> appear in WAL before the VSN of that transaction, so there should be
> no opportunity for ordering issues here.

You're correct that VSNs eliminate inconsistencies between the primary
and the replica (or failed-over/recovered system). If there's some
snapshot on the primary with T2 but not T1, there won't be a snapshot
somewhere else with T1 but not T2.

But we will have collapsed T1 and T2 into a single visibility event,
which is technically a loss of information, and we'd need to sort
through the nuances and implications.

> > That might have consequences for PITR. With recovery_target_xid, do
> > you
> > recover up to its commit record or its commit-visibility record?
>
> Good point.
>
> I suspect we'd have to make that a configurable option, but VSN
> (which
> would be end of WAL if no VSN was logged for that commit) would
> probably make the most sense, as it is least likely to cause snapshot
> issues if further wal may be replayed afterward.

Let recovery_target_xid=T1 and let the WAL contents be:

LSN 122 TL1: commit record for T0
LSN 123 TL1: commit record for T1
LSN 124 TL1: commit record for T2
LSN 125 TL1: commit-visible record for T1
LSN 126 TL1: commit-visible record for T0
LSN 127 TL1: whatever
LSN 128 TL1: commit-visible record for T2

If you recover to the commit-visible record (LSN=125), then T0's and
T2's commit records have been replayed, but not their commit-visible
record, so you must write the multi-visible record:

LSN 126 TL2: multi-visible record for {T0, T2}

before startup. That means you have effectively recovered to T2, not
T1. That's wrong, therefore we must define recovery based on commit
records.

If you recover to the commit record (LSN=123) instead, then T0's and
T1's commit records have been replayed, but not their commit-visible
records, so you must write a multi-visible record:

LSN 124 TL2: multi-visible record for {T0, T1}

which loses the information that T1 became visible before T0. That
might be acceptable, but CSNs just seem a lot simpler.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-06-11 22:33:13 Re: A trivial fix on a comment in pgstat_assoc_relation()
Previous Message Paul A Jungwirth 2026-06-11 21:37:45 Re: Move FOR PORTION OF checks out of analysis