Commit Sequence Numbers and Visibility

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Commit Sequence Numbers and Visibility
Date: 2026-06-02 23:33:41
Message-ID: ea6ecdc74dbce67849526668a461bc4760241439.camel@j-davis.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At pgconf.dev, we had an unconference session on $SUBJECT:

https://wiki.postgresql.org/wiki/PGConf.dev_2026_Developer_Unconference#Commit_Sequence_Numbers

CSN visibility semantics are quite simple in one sense, but raise at
least two important questions that I'd like to resolve in this thread:

1. When we acquire a snapshot, exactly what CSN do we choose?
2. Given a specific CSN, what durability requirements must
transactions included in the snapshot meet before the snapshot can
be used?

Those questions are easy to answer on the replica (just use the
last-applied record LSN), so the discussion below is mostly about what
to do on the primary.

The previous discussion at [1] did come up with some answers to those
questions, but I couldn't find much explanation about those topics
specifically. The discussion mostly focused on the mapping of an xid
to its commit LSN. The discussion at [2] didn't need to resolve either
question.

Definitions:

Current Procarray-based snapshot: Changes from a transaction are
visible if it has committed and is not in the procarray.

Proposed CSN-based snapshot: Changes from a transaction are visible if
the transaction's commit record LSN is at or before the CSN.

Motivation:

In either case, basic isolation semantics are preserved.

The problems with the Procarray-based snapshot definition are:

* it is based on the contents of procarray, which is hard to observe
and reason about externally;
* some snapshots on the primary are impossible on the replica; and
* it's inherently more expensive to acquire a new snapshot.

Using CSN-based snapshots solves these problems and also fits nicely
with our generally WAL-centric architecture.

Constraints on CSN choice for new snapshots:

a. CSN must be greater than or equal to the CSN of any previously
acquired snapshot.
b. CSN must be less than or equal to the last-inserted record.
c. All transactions that have released locks must be visible to the
snapshot.

Safety of using a CSN-based snapshot:

The commit record is written before CLOG is updated, so there's a
window where we could assign a CSN and the CLOG hasn't been
updated. We either need to wait until the CLOG update happens before
using the snapshot, or change the way CLOG and/or visibility checks
work to avoid that problem. Existing implementations in [1] do the
latter.

Proposal for Question 1:

I propose that we choose the CSN to be the minimum CSN that satisfies
constraint (c): the highest commit LSN of all the transactions that
have released locks. That offers a useful guarantee: because the last
transaction in the snapshot has already released locks, it has met its
own WAL durability requirement, and all other transactions in the
snapshot have met the same durability requirement (though not
necessarily their *own* durability requirement).

In the discussion at [1], some proposals used the WAL insert pointer
to assign the CSN, but that doesn't seem to have any advantage, and
makes it harder to come up with a satisfactory answer for question #2.

Proposal for Question #2:

Currently, for the procarray-based snapshots, a transaction will only
be included in the snapshot if it meets the durability requirement of
the transaction *writing the changes*. So, a sync rep transaction can
see the effects of a sync=local transaction before it's replicated,
and a sync=local transaction can see the effects of an async
transaction before it's flushed. (Arguably this is a bug.)

Instead, I propose that we wait until the CSN is flushed to the point
that it meets the durability requirements of the transaction *using
the snapshot* rather than the transaction *writing the changes*. To
me, that would be the least-surprising behavior.

If the workload uses a consistent durability requirement, then no
waiting will be required if we assign the CSN as proposed in the
answer to question #1. But it does risk regressions for workloads with
mixed durability requirements.

For instance, let's say that some sync transaction T1 that writes a
commit record with LSN 122 but has not flushed yet. Then there's an
async transaction T2 that writes a commit record with LSN 123, then
finishes and releases locks. Then sync transaction T3 takes a
snapshot, which must include T2 due to constraint (c), and therefore
also includes T1 because the commit LSN is less than that of T2. If T3
uses the snapshot right away, that would mean a sync transaction T3 is
reading the changes of another sync transaction T1 that hasn't flushed
yet, which is broken. The only solution is for T3 to wait at least
until LSN 122 is flushed, which could add latency to the sync reader
transactions. (With procarray-based snapshots, T1 would simply not be
included in the snapshot, but that's impossible for CSN-based
snapshots.)

We can mitigate that for some cases, such as a mix of async writers
and sync readers, by tracking additional information so that a sync
transaction doesn't wait for a flush if the only unflushed
transactions are async. But there will still be a worst-case workload:
where a steady stream of async writing transactions force
newly-assigned CSNs to be close to the insert pointer, along with
less-frequent sync writing transactions, and frequent sync reading
transactions that need to wait for the flushes to use the
snapshots. That worst-case workload may be impossible to solve in the
CSN world, so hopefully it's not a major problem.

Synchronous replication:

Synchronous replication works as expected here, too. If there's a
committed transaction that is canceled before replication, it
effectively counts as a sync=local transaction. Later sync=on
transactions will need to wait for it to be replicated.

Subtransactions & 2PC:

I don't think subtransactions or 2PC have a major impact on the
questions addressed in this design. If I'm missing something, please
let me know.

SERIALIZABLE:

I didn't fully analyze the implications for SSI yet, but I believe it's
compatible. Snapshots are still snapshots, and SSI can still detect
conflicts the same way as before. The serializable order can be
different from CSN order, which could be confusing, but that's an
existing problem.

Implementation notes:

We need to track a new shared global variable,
maxTransactionFinishedPtr, used to choose the CSN when acquiring a
snapshot. Represents the highest commit LSN of any transaction that
has released locks:

maxTransactionFinishedPtr = Max(maxTransactionFinishedPtr,
commitLSN)

We need to update the variable before releasing locks to avoid a race
where a transaction has released locks but is not included in the
snapshot. Also, we need to update the variable after CLOG is updated
and the after the durability level is reached, so that it still meets
the assumptions above. This variable can be maintained easily and
cheaply with pg_atomic_monotonic_advance_u64() along with appropriate
memory barriers.

Thoughts?

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/flat/CA%2BCSw_tEpJ%3Dmd1zgxPkjH6CWDnTDft4gBi%3D%2BP9SnoC%2BWy3pKdA%40mail.gmai
l.com

[2]
https://www.postgresql.org/message-id/flat/08da26cc-95ef-4c0e-9573-8b930f80ce27(at)iki(dot)fi

[3]
https://www.postgresql.org/message-id/171f45038c4.bf66538a492783.4211925925146471362%40highgo.ca

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2026-06-02 23:34:32 Re: Report oldest xmin source when autovacuum cannot remove tuples
Previous Message Sami Imseih 2026-06-02 23:29:05 Re: Unify parallel worker handling for index builds and instrumentation