Re: Global snapshots

From: Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Global snapshots
Date: 2018-07-25 11:35:02
Message-ID: 874lgna6qh.fsf@ars-thinkpad
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

I have looked through the patches and found them pretty accurate. I'd
fixed a lot of small issues here and there; updated patchset is
attached. But first, some high-level notes:

* I agree that it would be cool to implement functionality like current
"snapshot too old": that is, abort transaction with old global
snapshot only if it really attempted to touch modified data.

* I also agree with Stas that any attempts to trade oldestxmin in
gossip between the nodes would drastically complicate this patch and
make it discussion-prone; it would be nice first to get some feedback
on general approach, especially from people trying to distribute
Postgres.

* One drawback of these patches is that only REPEATABLE READ is
supported. For READ COMMITTED, we must export every new snapshot
generated on coordinator to all nodes, which is fairly easy to
do. SERIALIZABLE will definitely require chattering between nodes,
but that's much less demanded isolevel (e.g. we still don't support
it on replicas).

* Another somewhat serious issue is that there is a risk of recency
guarantee violation. If client starts transaction at node with
lagging clocks, its snapshot might not include some recently
committed transactions; if client works with different nodes, she
might not even see her own changes. CockroachDB describes at [1] how
they and Google Spanner overcome this problem. In short, both set
hard limit on maximum allowed clock skew. Spanner uses atomic
clocks, so this skew is small and they just wait it at the end of
each transaction before acknowledging the client. In CockroachDB, if
tuple is not visible but we are unsure whether it is truly invisible
or it's just the skew (the difference between snapshot and tuple's
csn is less than the skew), transaction is restarted with advanced
snapshot. This process is not infinite because the upper border
(initial snapshot + max skew) stays the same; this is correct as we
just want to ensure that our xact sees all the committed ones before
it started. We can implement the same thing.

Now, the description of my mostly cosmetical changes:

* Don't ERROR in BroadcastStmt to allow us to handle failure manually;
* Check global_snapshot_defer_time in ImportGlobalSnapshot instead of
falling on assert;
* (Arguably) improved comments around locking at circular buffer
maintenance; also, don't lock procarray during global_snapshot_xmin
bump.
* s/snaphot/snapshot, other typos.
* Don't track_global_snapshots by default -- while handy for testing, it
doesn't look generally good.
* Set track_global_snapshots = true in tests everywhere.
* GUC renamed from postgres_fdw.use_tsdtm to
postgres_fdw.use_global_snapshots for consistency.
* 003_bank_shared.pl test is removed. In current shape (loading one
node) it is useless, and if we bombard both nodes, deadlock surely
appears. In general, global snaphots are not needed for such
multimaster-like setup -- either there are no conflicts and we are
fine, or there is a conflict, in which case we get a deadlock.
* Fix initdb failure with non-zero global_snapshot_defer_time.
* Enforce REPEATABLE READ since currently we export snap only once in
xact.
* Remove assertion that circular buffer entries are monotonic, as
GetOldestXmin *can* go backwards.

[1] https://www.cockroachlabs.com/blog/living-without-atomic-clocks/

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
0001-GlobalCSNLog-SLRU-v2.patch text/x-diff 24.0 KB
0002-Global-snapshots-v2.patch text/x-diff 64.6 KB
0003-postgres_fdw-support-for-global-snapshots-v2.patch text/x-diff 32.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Klychkov 2018-07-25 11:46:14 Fwd: Re[2]: Alter index rename concurrently to
Previous Message Sergei Kornilov 2018-07-25 10:42:10 Re: Online enabling of checksums