| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Commit Sequence Numbers and Visibility |
| Date: | 2026-06-10 12:19:01 |
| Message-ID: | CAEze2WihVmsjrEdMXmqhmsy-R8puBjopuvpqmHYwFP_usF+aSw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, 3 Jun 2026 at 01:33, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> At pgconf.dev, we had an unconference session on $SUBJECT:
>
> https://wiki.postgresql.org/wiki/PGConf.dev_2026_Developer_Unconference#Commit_Sequence_Numbers
>
> CSN visibility semantics are quite simple in one sense, but raise at
> least two important questions that I'd like to resolve in this thread:
>
> 1. When we acquire a snapshot, exactly what CSN do we choose?
> 2. Given a specific CSN, what durability requirements must
> transactions included in the snapshot meet before the snapshot can
> be used?
>
> Those questions are easy to answer on the replica (just use the
> last-applied record LSN), so the discussion below is mostly about what
> to do on the primary.
I disagree with this framing; it assumes that the snapshot determines
the durability requirements for the transactions, and that is not a
decision that the snapshot mechanism necessarily must or should make.
Our current snapshot mechanism allows it to be a decision that the
committing transaction makes (+ edge cases related to cancellation
etc.); this means that snapshot performance isn't meaningfully
impacted by concurrent sessions' durability requirements.
> The previous discussion at [1] did come up with some answers to those
> questions, but I couldn't find much explanation about those topics
> specifically. The discussion mostly focused on the mapping of an xid
> to its commit LSN. The discussion at [2] didn't need to resolve either
> question.
>
>
> Definitions:
>
> Current Procarray-based snapshot: Changes from a transaction are
> visible if it has committed and is not in the procarray.
>
> Proposed CSN-based snapshot: Changes from a transaction are visible if
> the transaction's commit record LSN is at or before the CSN.
I'm not so stoked about this specifically, see below for details.
> Motivation:
>
> In either case, basic isolation semantics are preserved.
>
> The problems with the Procarray-based snapshot definition are:
>
> * it is based on the contents of procarray, which is hard to observe
> and reason about externally;
> * some snapshots on the primary are impossible on the replica; and
> * it's inherently more expensive to acquire a new snapshot.
This is mostly a "global" expense. I suspect that the local cost of
acquiring and using a snapshot is not so significantly different
between CSN and the procarray approach; I suspect CSNs may even be
more CPU-expensive (locally) than our PgProc approach if we need to
populate the [xmin, xmax] range of commit-is-visible checks from our
own backend through SLRU lookups.
> Using CSN-based snapshots solves these problems and also fits nicely
> with our generally WAL-centric architecture.
>
>
> Constraints on CSN choice for new snapshots:
>
> a. CSN must be greater than or equal to the CSN of any previously
> acquired snapshot.
... any previously acquired snapshot which was published before we
started acquiring a snapshot.
It should be fine to piggy-back on another backend's snapshot
acquisition as long as nobody else has published that they know of a
newer snapshot before we started getting that new snapshot. But that's
talking optimizations.
[...]
> Safety of using a CSN-based snapshot:
>
> The commit record is written before CLOG is updated, so there's a
> window where we could assign a CSN and the CLOG hasn't been
> updated. We either need to wait until the CLOG update happens before
> using the snapshot, or change the way CLOG and/or visibility checks
> work to avoid that problem. Existing implementations in [1] do the
> latter.
>
>
> Proposal for Question 1:
>
> I propose that we choose the CSN to be the minimum CSN that satisfies
> constraint (c): the highest commit LSN of all the transactions that
> have released locks. That offers a useful guarantee: because the last
> transaction in the snapshot has already released locks, it has met its
> own WAL durability requirement, and all other transactions in the
> snapshot have met the same durability requirement (though not
> necessarily their *own* durability requirement).
True, but this means that the effects of one transaction with a higher
durability requirements are (or, can be) published ahead of those
requirements actually being met. Or, that transactions with a lesser
durability expectation will start to have to wait for transaction with
a higher durability expectation to become durable. I don't think
either of these options is a good one.
It'll also make transaction-local GUC values for
synchronous_transaction much less safe.
> In the discussion at [1], some proposals used the WAL insert pointer
> to assign the CSN, but that doesn't seem to have any advantage, and
> makes it harder to come up with a satisfactory answer for question #2.
>
>
> Proposal for Question #2:
>
> Currently, for the procarray-based snapshots, a transaction will only
> be included in the snapshot if it meets the durability requirement of
> the transaction *writing the changes*. So, a sync rep transaction can
> see the effects of a sync=local transaction before it's replicated,
> and a sync=local transaction can see the effects of an async
> transaction before it's flushed. (Arguably this is a bug.)
>
> Instead, I propose that we wait until the CSN is flushed to the point
> that it meets the durability requirements of the transaction *using
> the snapshot* rather than the transaction *writing the changes*. To
> me, that would be the least-surprising behavior.
>
> If the workload uses a consistent durability requirement, then no
> waiting will be required if we assign the CSN as proposed in the
> answer to question #1. But it does risk regressions for workloads with
> mixed durability requirements.
>
> For instance, let's say that some sync transaction T1 that writes a
> commit record with LSN 122 but has not flushed yet. Then there's an
> async transaction T2 that writes a commit record with LSN 123, then
> finishes and releases locks. Then sync transaction T3 takes a
> snapshot, which must include T2 due to constraint (c), and therefore
> also includes T1 because the commit LSN is less than that of T2. If T3
> uses the snapshot right away, that would mean a sync transaction T3 is
> reading the changes of another sync transaction T1 that hasn't flushed
> yet, which is broken. The only solution is for T3 to wait at least
> until LSN 122 is flushed, which could add latency to the sync reader
> transactions.
The only solution, unless visibility is determined (and logged)
separately from durability, in which case it could be made to work by
requiring transactions with higher durability expectations to log
their visibility, or offloading that visibility-logging job to a
background process.
It doesn't even have to be very fancy; including the durability
requirement in the commit record, and then logging the 3 commit
durability horizons every once in a short while for as long as there
are commits waiting for durability (and actually become durable)
should be sufficient - such a "durability horizons" record should be
sufficient as CSN (visibility horizon) for the differently-durable
commits that were logged ahead of their respective durability horizon.
It'd be comparable to the commit_delay system that combines commits'
fsyncs, but with an additional XLog record getting logged.
> (With procarray-based snapshots, T1 would simply not be
> included in the snapshot, but that's impossible for CSN-based
> snapshots.)
>
> We can mitigate that for some cases, such as a mix of async writers
> and sync readers, by tracking additional information so that a sync
> transaction doesn't wait for a flush if the only unflushed
> transactions are async. But there will still be a worst-case workload:
> where a steady stream of async writing transactions force
> newly-assigned CSNs to be close to the insert pointer, along with
> less-frequent sync writing transactions, and frequent sync reading
> transactions that need to wait for the flushes to use the
> snapshots.
I don't think that trading 'locked snapshot acquisition' for
'synchronous durability waits in snapshot acquisition' is a good
trade.
On average it'll cause transactions in workloads with mixed durability
expectations (s_c values of off/local, off/remote_*, local/remote_*)
to have significantly worse latency than they currently have: In a
(likely) worst case of such a mixed workload, the session with higher
durability requirements will have to wait for the durability of every
snapshot it takes (it's likely the CSN will be closer to the latest
record insert pointer than the durability horizon needed), and will
also have to wait for the full durability of your own commit record;
while on HEAD we only ever wait for the locks on the procarray, and
[ignoring cancellation bugs in SyncRepWait] wait for the durability of
your commit record (because every visible commit's durability is
already guaranteed before they become visible).
> That worst-case workload may be impossible to solve in the
> CSN world, so hopefully it's not a major problem.
An approach with commit-visibility records would fix this; making
transactions visible using a separate visibility (or horizon) record
whose LSN is the transaction's CSN visibility threshold, which would
allow us to skip the synchronous waits for CSN durability of mixed
sync and async transactions.
[...]
> SERIALIZABLE:
>
> I didn't fully analyze the implications for SSI yet, but I believe it's
> compatible. Snapshots are still snapshots, and SSI can still detect
> conflicts the same way as before. The serializable order can be
> different from CSN order, which could be confusing, but that's an
> existing problem.
Do you have a reference where I can read up on this issue?
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Marko Grujic | 2026-06-10 12:23:14 | Re: [PATCH v1] [BUG #19516] Skip whole-row projection shortcut for OLD/NEW returning type |
| Previous Message | Amit Langote | 2026-06-10 12:17:07 | Re: PG19 FK fast path: OOB write and missed FK checks during batched |