| From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
|---|---|
| To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
| Cc: | pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Commit Sequence Numbers and Visibility |
| Date: | 2026-06-12 10:03:55 |
| Message-ID: | CAEze2Wh7L1ONOEsSf7KY40N0i1ZCrk14W5b=cGqBsw+mGXPEdQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, 12 Jun 2026 at 00:24, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> I see "Visibility Sequence Numbers" as a somewhat-unfortunate
> compromise. VSNs get the right information in WAL to keep the primary
> and replica consistent, which is certainly good; but they leave us with
> visibility semantics that don't match the commit order, which is a
> source of confusion.
>
> If we have the right set of configuration knobs as suggested above (to
> handle mixed async/sync workloads), do you still think we should pursue
> VSNs over CSNs?
Maybe we should add VSNs as fourth option in the list, as it keeps
PG's current visibility ordering mostly intact - except in the case of
recovery where all transactions might become visible all at once.
Which is not something I'm particularly worried about, for the reasons
I mentioned in my mail yesterday.
Current snapshot semantics in PG are quite nice [ignoring cancellation
bugs, and replicas] for workloads with mixed durability requirements,
in that visibility == transaction's configured durability, and no
snapshot has to spend its own time waiting for a transaction's
durability expectation, unless it is explicitly blocked by that
transaction's locks.
> > Correct, but are we also targeting true snapshot transferability from
> > primary to replicas?
>
> That would be a significant benefit of CSNs over VSNs. There would
> still be some details to work out about timelines, and maybe other
> things I haven't considered, but CSNs get us a lot closer.
CommandIds are a big issue with full RW snapshot transfers. RO
snapshot transfers would be easy with both CSN and VSN, in both cases
assuming that both systems are using the same timeline; it's just that
for those transfers you may have to add some not-yet-durable ordering
data (or wait for the VSN to arrive on the replica).
> > Note that I believe that there is no meaningful difference from a
> > consistency standpoint between T1 and T2 both becoming visible at the
> > same time (with the same VSN) during recovery, and the system not
> > having any snapshot acquired between their VSNs -- which is something
> > that could happen on replicas. Side effects of a transaction won't
> > appear in WAL before the VSN of that transaction, so there should be
> > no opportunity for ordering issues here.
>
> You're correct that VSNs eliminate inconsistencies between the primary
> and the replica (or failed-over/recovered system). If there's some
> snapshot on the primary with T2 but not T1, there won't be a snapshot
> somewhere else with T1 but not T2.
>
> But we will have collapsed T1 and T2 into a single visibility event,
> which is technically a loss of information, and we'd need to sort
> through the nuances and implications.
I don't think there are many nuances. If the replica/recovery doesn't
have the VSN of certain transactions yet, then there also can't be any
effects in the database that depend on the VSN ordering of those
transactions: locks are only released after the VSN is assigned, so
any persistent VSN ordering dependencies can only appear after the
VSN.
And, for external readers, at worst, they won't be able to see or know
the intermediate snapshot ordering they could've seen on the primary
before the server crash/failover/recovery point, but after the last
durable WAL was logged. I don't think we should break our minds over
that: every transaction with its (sufficiently high) durability
guarantee satisfied that was visible on the primary will be visible
after recovery/promotion.
> Let recovery_target_xid=T1 and let the WAL contents be:
>
> LSN 122 TL1: commit record for T0
> LSN 123 TL1: commit record for T1
> LSN 124 TL1: commit record for T2
> LSN 125 TL1: commit-visible record for T1
> LSN 126 TL1: commit-visible record for T0
> LSN 127 TL1: whatever
> LSN 128 TL1: commit-visible record for T2
>
> If you recover to the commit-visible record (LSN=125), then T0's and
> T2's commit records have been replayed, but not their commit-visible
> record, so you must write the multi-visible record:
>
> LSN 126 TL2: multi-visible record for {T0, T2}
>
> before startup. That means you have effectively recovered to T2, not
> T1. That's wrong, therefore we must define recovery based on commit
> records.
>
> If you recover to the commit record (LSN=123) instead, then T0's and
> T1's commit records have been replayed, but not their commit-visible
> records, so you must write a multi-visible record:
>
> LSN 124 TL2: multi-visible record for {T0, T1}
>
> which loses the information that T1 became visible before T0. That
> might be acceptable, but CSNs just seem a lot simpler.
It's probably simpler, yes, but the VSN approach does resolve my concerns about
i.) commit throughput in light of the waits introduced to every
backend by options (a) and (b), and
ii.) the lack of visible data's commit durability guarantee in (c).
Adjusting (c) with durability waits when the snapshot encounters data
modified by not-yet-fully-durable transactions would just add this
option to the 'newly introduced waits' concern list, too.
None of that would be a concern if we only allowed a global
synchronous_commit setting, but that's not a world we live in, nor a
world we're planning to live in (I think), so I do worry about these
concerns.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Renaud Métrich | 2026-06-12 10:05:24 | [PATCH v1] Add ssl_alt_cert_file/ssl_alt_key_file for dual RSA+ECDSA certificate support |
| Previous Message | Chao Li | 2026-06-12 10:01:17 | Fix psql pager selection for wrapped expanded output |