| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Commit Sequence Numbers and Visibility |
| Date: | 2026-06-10 19:18:16 |
| Message-ID: | CAEze2WjDMf6LX3WQa-95qNZca1eH1vtJOPNobMuMDO+NHzgDGw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, 10 Jun 2026 at 18:40, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Wed, 2026-06-10 at 14:19 +0200, Matthias van de Meent wrote:
> > Our current snapshot mechanism allows it to be a decision that the
> > committing transaction makes (+ edge cases related to cancellation
> > etc.);
>
> That's true, but I thought that was considered to be incidental and
> undesirable behavior.
I think it's desireable that snapshots don't need to take special care
around the durability of the transactions that are included in their
snapshot.
Async transactions may want to see sync transactions' durable data,
but probably prefer not to have to wait for that durability just
because its session logged its latest commit record after the sync
transaction did.
> >
> >
> > visibility is determined (and logged)
> > separately from durability,
>
> That stretches the definition of "CSN". We'd have to redefine "commit"
> to mean "when the transaction writes the commit-visibility record". But
> crash recovery and failover would be using a different definition?
Visibility would presumably happen after the recovery/replica has made
sure that the durability of the pending-visible transactions is
guaranteed; which presumably could be done with a sync wait at
end-of-recovery.
> > in which case it could be made to work by
> > requiring transactions with higher durability expectations to log
> > their visibility, or offloading that visibility-logging job to a
> > background process.
> >
> > It doesn't even have to be very fancy; including the durability
> > requirement in the commit record, and then logging the 3 commit
> > durability horizons every once in a short while for as long as there
> > are commits waiting for durability (and actually become durable)
> > should be sufficient - such a "durability horizons" record should be
> > sufficient as CSN (visibility horizon) for the differently-durable
> > commits that were logged ahead of their respective durability
> > horizon.
> > It'd be comparable to the commit_delay system that combines commits'
> > fsyncs, but with an additional XLog record getting logged.
>
> If I understand your proposal:
>
> Naively, each committing transaction can write an additional commit-
> visibility record right before ProcArrayEndTransaction(), but that
> could be expensive.
Correct. Not all that much more expensive, but for unlogged workloads
it could double the number of WAL records emitted.
> You are suggesting an optimization to infer the visibility LSN of a
> given transaction from its commit record (including an extra durability
> requirement setting) along with some other records in the WAL that are
> written less often than every commit. Can you describe in more detail
> what will be logged and how you make the inference?
I've thought about these different approaches:
0.) synchronous_commit=off doesn't care about durability and doesn't
need to wait to release its locks; and is excluded from consideration
in the systems below.
The commit record itself would be used to supply the CSN visibility threshold.
1.) Each commit for themselves, logging their own visibility record
when they achieve durability.
Lowest per-commit durability latency, but higher pressure on WAL
insertions. The visibility record's LSN would be the CSN threshold for
the visibility of this XID, and doesn't need a separate durability
guarantee - recovery has rules to recover, apply, and make visible
those yet-to-become-visible transactions.
1a.) A batched approach to 1, where one backend gathers sibling
committers (like in commit_delay) with the same durability
expectations for a single visibility record that includes the sibling
commits' IDs; they all become visible once that record's logged.
This reduces WAL pressure, at the cost of slightly increased latency
(the oldest xact will have to wait for the newest xact in the batch to
also become durable).
3.) A BGWorker approach, where each backend includes its durability
requirement (sync/remote_write/remote_apply) in the commit record, and
a bgworker logs records every (e.g.) commit_delay that include the
(monotonically increasing) LSNs of the various durability horizons, if
any commits were waiting for that horizon (sync/rw/ra). This limits
WAL overhead to once per commit_delay, at the cost of each commit
having an up to commit_delay latency penalty past their durability
window; and less fine-grained snapshot boundaries.
> Also, how do you choose the CSN when acquiring a snapshot? I assume
> just the insert pointer?
Yes, that, or the record-end pointer of (e.g.) the most recently
inserted visibility record.
> > [...]
> > > SERIALIZABLE:
> > >
> > > I didn't fully analyze the implications for SSI yet, but I believe
> > > it's
> > > compatible. Snapshots are still snapshots, and SSI can still detect
> > > conflicts the same way as before. The serializable order can be
> > > different from CSN order, which could be confusing, but that's an
> > > existing problem.
> >
> > Do you have a reference where I can read up on this issue?
>
> https://arxiv.org/pdf/1208.4179
Thanks!
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alexander Lakhin | 2026-06-10 20:00:00 | Re: Why our Valgrind reports suck |
| Previous Message | Nathan Bossart | 2026-06-10 19:16:17 | fix prev link in docs |