| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Commit Sequence Numbers and Visibility |
| Date: | 2026-06-11 08:28:53 |
| Message-ID: | CAEze2WgmnTkp9orLimBz_qDV0DG5tDG6UWxeq3KMeJbrV2cPHw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Thu, 11 Jun 2026 at 02:50, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Wed, 2026-06-10 at 21:18 +0200, Matthias van de Meent wrote:
> > I think it's desireable that snapshots don't need to take special
> > care
> > around the durability of the transactions that are included in their
> > snapshot.
> > Async transactions may want to see sync transactions' durable data,
> > but probably prefer not to have to wait for that durability just
> > because its session logged its latest commit record after the sync
> > transaction did.
>
> For the problem sequence with CSNs:
>
> 1. sync transaction T1 writes commit record
> 2. async transaction T2 writes commit record
> 3. T2 releases locks
> 4. T3 takes a snapshot
> 5. T1's commit record is flushed
> 6. T1 releases locks
>
> there are three resolutions:
>
> a. force T3 to wait until T1's commit record is flushed before
> using the snapshot, slowing down the sync part of the
> workload; or
>
> b. force T2 to wait until all unflushed sync transactions with
> an earlier commit LSN are flushed before releasing locks
> (thereby making the above seqeunce impossible), slowing down
> the async part of the workload; or
>
> c. let T3 use the snapshot immediately, potentially returning
> unflushed data to the client.
>
> Perhaps none of those options is great for everyone, but we could allow
> users to select the behavior they want. That seems better than today,
> when any async transaction can cause sync transactions to start
> returning unflushed data to the client, and there's no way to prevent
> that.
Yes, "wait for data I'm reading to match my own durability
requirement" should be configurable.
But I'm not sure if we want to do that at snapshot acquisition time,
or only when the data is being read that was modified by said
transaction -- the latter would be a very attractive optimization for
workloads with different durability expectations which touch
completely -or, mostly- disjoint datasets in the same database.
> > >
> > Visibility would presumably happen after the recovery/replica has
> > made
> > sure that the durability of the pending-visible transactions is
> > guaranteed; which presumably could be done with a sync wait at
> > end-of-recovery.
>
> That might be fine, but it's different from CSNs as I understand the
> meaning (at least in the simplest and most intuitive meaning). Should
> we call them "Visibility Sequence Numbers" (VSNs) instead?
I think that'd help the difference between durability and visibility, yes.
> For instance, given a sequence like (T1-T3 all sync transactions):
>
> 1. T1 writes commit record at LSN 123
> 2. T2 writes commit record at LSN 124
> 3. Flush to LSN 124
> 4. T2 writes commit-visibility record at LSN 125
> 5. T3 takes a snapshot with VSN=125 and returns data to client that
> includes T2's changes but not T1's changes
> 6. crash
>
> Recovery does not have enough information to know whether T1 or T2
> became visible first, so we'd probably need to create a new "multi-
> commit-visible" record that makes them all visible at the same LSN.
> That would mean that the snapshot taken by T3 (that was externally
> observed) could never exist in the recovered system, even though both
> T1 and T2 exist.
Correct, but are we also targeting true snapshot transferability from
primary to replicas? I don't think we need to be able to recreate all
snapshots that could exist on a primary on the replica (we'd have to
consider CommandIDs), as long as all snapshots that can be created on
a replica can also have been created on the primary; which I think is
what this VSN approach is able to guarantee.
Note that I believe that there is no meaningful difference from a
consistency standpoint between T1 and T2 both becoming visible at the
same time (with the same VSN) during recovery, and the system not
having any snapshot acquired between their VSNs -- which is something
that could happen on replicas. Side effects of a transaction won't
appear in WAL before the VSN of that transaction, so there should be
no opportunity for ordering issues here.
> That might have consequences for PITR. With recovery_target_xid, do you
> recover up to its commit record or its commit-visibility record?
Good point.
I suspect we'd have to make that a configurable option, but VSN (which
would be end of WAL if no VSN was logged for that commit) would
probably make the most sense, as it is least likely to cause snapshot
issues if further wal may be replayed afterward.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ewan Young | 2026-06-11 08:31:39 | Re: Discard ORDER BY/DISTINCT when an ANY/IN sublink is pulled up to a join |
| Previous Message | Junwang Zhao | 2026-06-11 08:18:15 | Re: PG19 FK fast path: OOB write and missed FK checks during batched |