Global snapshots

From: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Global snapshots
Date: 2018-05-01 16:27:22
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Global snapshots

Here proposed a set of patches that allow achieving proper snapshot isolation
semantics in the case of cross-node transactions. Provided infrastructure to
synchronize snapshots between different Postgres instances and a way to
atomically commit such transaction with respect to other concurrent global and
local transactions. Such global transactions can be coordinated outside of
Postgres by using provided SQL functions or through postgres_fdw, which make use
of this functions on remote nodes transparently.


Several years ago was proposed extensible transaction manager (XTM) patch
[1,2,3] which allowed extensions to hook and override transaction-related
functions like a xid assignment, snapshot acquiring, transaction tree
commit/abort, etc. Also, two transaction management extensions were created
based on that API: pg_dtm, pg_tsdtm [4].

The first one, pg_dtm, was inspired by Postgres-XL and represents a direct
generalization of Postgres MVCC to the cross-node scenario: there is standalone
service (DTM) that maintains a xid counter and an array of running backends on
all nodes. Every node keeps an open connection to a DTM and asks for xids and
Postgres-style snapshots by a network call. While this approach is reasonable
for low transaction rates (which is common for OLAP-like load) it quickly gasps
for OLTP case.

The second one, pg_tsdtm, based on Clock-SI paper [5] implements distributed
snapshot synchronization protocol without the necessity of central service like
DTM. It makes use of CSN-based visibility and requires two steps of
communication during transaction commit. Since commit between nodes each of
which can abort transaction is usually done by 2PC protocol two steps of
communication are anyway already needed.

The current patch is a reimplementation of pg_tsdtm moved into core directly
without using any API. The decision to drop XTM API was based on two thoughts.
At first XTM API itself needed a huge pile of work to be done to became an
actual API instead of the set of hooks over current implementation of MVCC with
extensions responsible for handling random static variables like
RecentGlobalXmin and friend, which other guts make use of. For the second in all
of our test pg_tsdtm was better and faster, but that wasn't obvious from the
beginning of work on XTM/DTM. So we decided that it is better to focus on the
good implementation of Clock-SI approach.


Clock-SI is described in [5] and here I provide a small overview, which
supposedly should be enough to catch the idea. Assume that each node runs Commit
Sequence Number (CSN) based visibility: database tracks one counter for each
transaction start (xid) and another counter for each transaction commit (csn).
In such setting, a snapshot is just a single number -- a copy of current CSN at
the moment when the snapshot was taken. Visibility rules are boiled down to
checking whether current tuple's CSN is less than our snapshot's csn. Also it
worth of mentioning that for the last 5 years there is an active proposal to
switch Postgres to CSN-based visibility [6].

Next, let's assume that CSN is current physical time on the node and call it
GlobalCSN. If the physical time on different nodes would be perfectly
synchronized then such snapshot exported on one node can be used on other nodes.
But unfortunately physical time never perfectly sync and can drift, so that fact
should be taken into mind. Also, there is no easy notion of lock or atomic
operation in the distributed environment, so commit atomicity on different nodes
with respect to concurrent snapshot acquisition should be handled somehow.
Clock-SI addresses that in following way:

1) To achieve commit atomicity of different nodes intermediate step is
introduced: at first running transaction is marked as InDoubt on all nodes,
and only after that each node commit it and stamps with a given GlobalCSN.
All readers who ran into tuples of an InDoubt transaction should wait until
it ends and recheck visibility. The same technique was used in [6] to achieve
atomicity of subtransactions tree commit.

2) When coordinator is marking transactions as InDoubt on other nodes it
collects ProposedGlobalCSN from each participant which is local time at that
nodes. Next, it selects the maximal value of all ProposedGlobalCSN's and
commits the transaction on all nodes with that maximal GlobaCSN. Even if that
value is greater than current time on this node due to clock drift. So the
GlobalCSN for the given transaction will be the same on all nodes.

3) When local transaction imports foreign global snapshot with some GlobalCSN
and current time on this node is smaller then incoming GlobalCSN then the
transaction should wait until this GlobalCSN time comes on the local clock.

Rules 2) and 3) provide protection against time drift. Paper [5] proves that
this is enough to guarantee a proper Snapshot Isolation.


Main design decision of this patch is trying to affect the performance of local
transaction as low as possible while providing a way to make global
transactions. GUC variable track_global_snapshots enables/disables this feature.

Patch 0001-GlobalCSNLog introduces new SLRU instance that maps xids to
GlobalCSN. GlobalCSNLog code is pretty straightforward and more or less copied
from SUBTRANS log which is also not persistent. Also, I kept an eye on the
corresponding part of Heikki's original patch for CSN's in [6] and commit_ts.c.

Patch 0002-Global-snapshots provides infrastructure to snapshot sync functions
and global commit functions. It consists of several parts which would be enabled
when GUC track_global_snapshots is on:

* Each Postgres snapshot acquisition is accompanied by taking current GlobalCSN
under the same shared ProcArray lock.

* Each transaction commit also writes current GlobalCSN into GlobalCSNLog. To
avoid writes to SLRU under exclusive ProcArray lock (which would be the major
hit on commit performance) trick with intermediate InDoubt state is used:
before calling ProcArrayEndTransaction() backend writes InDoubt state in SLRU,
then inside of ProcArrayEndTransaction() under a ProcArray lock GlobalCSN is
assigned, and after the lock is released assigned GlobalCSN value is written
to GlobalCSNLog SLRU. This approach ensures XIP-based snapshots and
GlobalCSN-based are going to see the same subset of tuple versions without
putting too much extra contention on ProcArray lock.

* XidInMVCCSnapshot can use both XIP-based and GlobalCSN-based snapshot. If the
current transaction is local one then at first XIP-based check is performed,
then if the tuple is visible don't do any further checks; if this xid is
in-progress we need to fetch GlobalCSN from SLRU and recheck it with
GlobalCSN-based visibility rules, as it may be part of global InDoubt
transaction. So, for local transactions XidInMVCCSnapshot() will fetch SLRU
entry only for in-progress transactions. This can be optimized further: it is
possible to store a flag in a snapshot which indicates whether there were any
active global transactions when this snapshot was taken. Then if there were no
global transactions during snapshot creation SLRU access in
XidInMVCCSnapshot() can be skipped at all. Otherwise, if current backend have
the snapshot that was imported from foreign node then we use only
GlobalCSN-based visibility rules (as we don't have any information about how
XIP array looked like when that GlobalCSN was taken).

* Import/export routines provided for global snapshots. Export just returns
current snapshot's global_csn. Import, on the other hand, is more complicated.
Since imported GlobalCSN usually points in past we should hold OldestXmin and
ensure that tuple versions for given GlobalCSN don't pass through OldestXmin
boundary. To achieve that, mechanism like one in SnapshotTooOld is created: on
each snapshot creation current oldestXmin is written to a sparse ring buffer
which holds oldestXmin entries for a last global_snapshot_defer_time seconds.
GetOldestXmin() is delaying its results to the oldest entry in this ring
buffer. If we asked to import snapshot that is later then current_time -
global_snapshot_defer_time then we just error out with "global snapshot too
old" message. Otherwise, we have enough info to set proper xmin in our proc
entry to defuse GetOldestXmin().

* Following routines for commit provided:
* pg_global_snaphot_prepare(gid) sets InDoubt state for given global tx and
return proposed GlobalCSN.
* pg_global_snaphot_assign(gid, global_csn) assign given global_csn to given
global tx. Consequent COMMIT PREPARED will use that.

Import/export and commit routines are made as SQL functions. IMO it's better to
GLOBAL / COMMIT GLOBAL PREPARED. But at this stage, I don't want to clutter this
patch with new grammar.

Patch 0003-postgres_fdw-support-for-global-snapshots uses the previous
infrastructure to achieve isolation in the generated transaction. Right now it
is a minimalistic implementation of 2PC like one in [7], but which doesn't care
about writing something about remote prepares in WAL on coordinator and doesn't
care about any recovery. The main usage is to test things in global snapshot
patch, as it easier to do with TAP tests over several postgres_fdw-connected
instances. Tests are included.


The distributed transaction can be coordinated by an external coordinator. In
this case normal scenario would be following:

1) Start transaction on the first node, do some queries if needed, call

2) On the other node open transaction and call
pg_global_snaphot_import(global_csn). global_csn is from previous export.

... do some useful work ...

3) Issue PREPARE for all participant.
4) Call pg_global_snaphot_prepare(gid) on all participant and store returned

select maximal over of returned global_csn's.

5) Call pg_global_snaphot_assign(gid, max_global_csn) on all participant.
6) Issue COMMIT PREPARED for all participant.

As it was said earlier steps 4) and 5) can be melded into 3) and 5)
respectively, but let's go for grammar changes after and if there will be
agreement on the overall concept.

postgres_fdw in 003-GlobalSnapshotsForPostgresFdw does the same thing, but
transparently to the user.


Each XidInMVCCSnapshot() in this patch is coated with assert that checks that
same things are visible under XIP-based check and GlobalCSN one.

Patch 003-GlobalSnapshotsForPostgresFdw includes numerous variants of banking
test. It spin-offs several Postgres instances and several concurrent pgbenches
which simulate cross-node bank account transfers, while test constantly checks
that total balance is correct. Also, there are tests that imported snapshot is
going to see the same checksum of data as it was during import.

I think this feature also deserves separate test module that will wobble the
clock time, but that isn't done yet.

Original XTM, pg_dtm, pg_tsdtm were written by Konstantin Knizhnik, Constantin
Pan and me.
This version is mostly on me, with some inputs by Konstantin
Knizhnik and Arseny Sher.

(originally thread was about fdw-based sharding, but later got "hijacked" by xtm)

Attachment Content-Type Size
0001-GlobalCSNLog-SLRU.patch application/octet-stream 24.0 KB
0002-Global-snapshots.patch application/octet-stream 63.6 KB
0003-postgres_fdw-support-for-global-snapshots.patch application/octet-stream 35.9 KB
unknown_filename text/plain 3 bytes


Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-05-01 16:30:44 Re: Local partitioned indexes and pageinspect
Previous Message Andres Freund 2018-05-01 16:26:30 Re: Usage of pg_waldump