Quick Links

Re: Global snapshots

From:	Arseny Sher <a(dot)sher(at)postgrespro(dot)ru>
To:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc:	Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Global snapshots
Date:	2018-09-26 15:02:47
Message-ID:	87pnx0fgiw.fsf@ars-thinkpad
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello,

Andrey Borodin <x4mmm(at)yandex-team(dot)ru> writes:

> I like the idea that with this patch set universally all postgres
> instances are bound into single distributed DB, even if they never
> heard about each other before :) This is just amazing. Or do I get
> something wrong?

Yeah, in a sense of xact visibility we can view it like this.

> I've got few questions:
> 1. If we coordinate HA-clusters with replicas, can replicas
> participate if their part of transaction is read-only?

Ok, there are several things to consider. Clock-SI as described in the
paper technically boils down to three things. First two assume CSN-based
implementation of MVCC where local time acts as CSN/snapshot source, and
they impose the following additional rules:

1) When xact expands on some node and imports its snapshot, it must be
blocked until clocks on this node will show time >= snapshot being
imported: node never processes xacts with snap 'from the future'
2) When we choose CSN to commit xact with, we must read clocks on
all the nodes who participated in it and set CSN to max among read
values.

These rules ensure *integrity* of the snapshot in the face of clock
skews regardless of which node we access: that is, snapshots are stable
(no non-repeatable reads) and no xact is considered half committed: they
prevent situation when the snapshot sees some xact on one node as
committed and on another node as still running.
(Actually, this is only true under the assumption that any distributed
xact is committed at all nodes instantly at the same time; this is
obviously not true, see 3rd point below.)

If we are speaking about e.g. traditional usage of hot standy, when
client in one xact accesses either primary, or some standby, but not
several nodes at once, we just don't need this stuff because usual MVCC
in Postgres already provides you consistent snapshot. Same is true for
multimaster-like setups, when each node accepts writes, but client still
accesses single node; if there is a write conflict (e.g. same row
updated on different nodes), one of xacts must be aborted; the snapshot
is still good.

However, if you really intend to read in one xact data from multiple
nodes (e.g. read primary and then replica), then yes, these problems
arise and Clock-SI helps with them. However, honestly it is hard for me
to make up a reason why would you want to do that: reading local data is
always more efficient than visiting several nodes. It would make sense
if we could read primary and replica in parallel, but that currently is
impossible in core Postgres. More straightforward application of the
patchset is sharding, when data is splitted and you might need to go to
several nodes in one xact to collect needed data.

Also, this patchset adds core algorithm and makes use of it only in
postgres_fdw; you would need to adapt replication (import/export global
snapshot API) to make it work there.

3) The third rule of Clock-SI deals with the following problem.
Distributed (writing to several nodes) xact doesn't commit (i.e.
becomes visible) instantly at all nodes. That means that there is a
time hole in which we can see xact as committed on some node and still
running on another. To mitigate this, Clock-SI adds kind of two-phase
commit on visibility: additional state InDoubt which blocks all
attempts to read this xact changes until xact's fate (commit/abort) is
determined.

Unlike the previous problem, this issue exists in all replicated
setups. Even if we just have primary streaming data to one hot standby,
xacts are not committed on them instantly and we might observe xact as
committed on primary, then quickly switch to standby and find the data
we have just seen disappeared. remote_apply mode partially alleviates
this problem (apparently to the degree comfortable for most application
developers) by switching positions: with it xact always commits on
replicas earlier than on master. At least this guarantees that the guy
who wrote the xact will definitely see it on replica unless it drops the
connection to master before commit ack. Still the problem is not fully
solved: only addition of InDoubt state can fix this.

While Clock-SI (and this patchset) certainly addresses the issue as it
becomes even more serious in sharded setups (it makes possible to see
/parts/ of transactions), there is nothing CSN or clock specific
here. In theory, we could implement the same two-phase commit on
visiblity without switching to timestamp-based CSN MVCC.

Aside from the paper, you can have a look at Clock-SI explanation in
these slides [1] from PGCon.

> 2. How does InDoubt transaction behave when we add or subtract leap seconds?

Good question! In Clock-SI, time can be arbitrary desynchronized and
might go forward with arbitrary speed (e.g. clocks can be stopped), but
it must never go backwards. So if leap second correction is implemented
by doubling the duration of certain second (as it usually seems to be),
we are fine.

> Also, I could not understand some notes from Arseny:
>
>> 25 июля 2018 г., в 16:35, Arseny Sher <a(dot)sher(at)postgrespro(dot)ru> написал(а):
>>
>> * One drawback of these patches is that only REPEATABLE READ is
>> supported. For READ COMMITTED, we must export every new snapshot
>> generated on coordinator to all nodes, which is fairly easy to
>> do. SERIALIZABLE will definitely require chattering between nodes,
>> but that's much less demanded isolevel (e.g. we still don't support
>> it on replicas).
>
> If all shards are executing transaction in SERIALIZABLE, what anomalies does it permit?
>
> If you have transactions on server A and server B, there are
> transactions 1 and 2, transaction A1 is serialized before A2, but B1
> is after B2, right?
>
> Maybe we can somehow abort 1 or 2?

Yes, your explanation is concise and correct.
To put it in another way, ensuring SERIALIZABLE in MVCC requires
tracking reads, and there is no infrastructure for doing it
globally. Classical write skew is possible: you have node A holding x
and node B holding y, initially x = y = 30 and there is a constraint x +
y > 0.
Two concurrent xacts start:
T1: x = x - 42;
T2: y = y - 42;
They don't see each other, so both commit successfully and the
constraint is violated. We need to transfer info about reads between
nodes to know when we need to abort someone.

>>
>> * Another somewhat serious issue is that there is a risk of recency
>> guarantee violation. If client starts transaction at node with
>> lagging clocks, its snapshot might not include some recently
>> committed transactions; if client works with different nodes, she
>> might not even see her own changes. CockroachDB describes at [1] how
>> they and Google Spanner overcome this problem. In short, both set
>> hard limit on maximum allowed clock skew. Spanner uses atomic
>> clocks, so this skew is small and they just wait it at the end of
>> each transaction before acknowledging the client. In CockroachDB, if
>> tuple is not visible but we are unsure whether it is truly invisible
>> or it's just the skew (the difference between snapshot and tuple's
>> csn is less than the skew), transaction is restarted with advanced
>> snapshot. This process is not infinite because the upper border
>> (initial snapshot + max skew) stays the same; this is correct as we
>> just want to ensure that our xact sees all the committed ones before
>> it started. We can implement the same thing.
> I think that this situation is also covered in Clock-SI since
> transactions will not exit InDoubt state before we can see them. But
> I'm not sure, chances are that I get something wrong, I'll think more
> about it. I'd be happy to hear comments from Stas about this.

InDoubt state protects from seeing xact who is not yet committed
everywhere, but it doesn't protect from starting xact on node with
lagging clocks, obtaining plainly old snapshot. We won't see any
half-committed data (InDoubt covers us here) with it, but some recently
committed xacts might not get into our old snapshot entirely.

>> * 003_bank_shared.pl test is removed. In current shape (loading one
>> node) it is useless, and if we bombard both nodes, deadlock surely
>> appears. In general, global snaphots are not needed for such
>> multimaster-like setup -- either there are no conflicts and we are
>> fine, or there is a conflict, in which case we get a deadlock.
> Can we do something with this deadlock? Will placing an upper limit on
> time of InDoubt state fix the issue? I understand that aborting
> automatically is kind of dangerous...

Sure, this is just a generalization of basic deadlock problem to
distributed system. To deal with it, someone must periodically collect
locks across the nodes, build graph, check for loops and punish (abort)
one of its chain creators, if loop exists. InDoubt (and global snapshots
in general) is unrelated to this, we hanged on usual row-level lock in
this test. BTW, our pg_shardman extension has primitive deadlock
detector [2], I suppose Citus [3] also has one.

> Also, currently hanging 2pc transaction can cause a lot of headache
> for DBA. Can we have some kind of protection for the case when one
> node is gone permanently during transaction?

Oh, subject of automatic 2PC xacts resolution is also matter of another
(probably many-miles) thread, largely unrelated to global
snapshots/visibility. In general, the problem of distributed transaction
commit which doesn't block while majority of nodes is alive requires
implementing distributed consensus algorithm like Paxos or Raft. You
might also find thread [4] interesting.

Thank you for your interest in the topic!

[1] https://yadi.sk/i/qgmFeICvuRwYNA
[2] https://github.com/postgrespro/pg_shardman/blob/broadcast/pg_shardman--0.0.3.sql#L2497
[3] https://www.citusdata.com/
[4] https://www.postgresql.org/message-id/flat/CAFjFpRc5Eo%3DGqgQBa1F%2B_VQ-q_76B-d5-Pt0DWANT2QS24WE7w%40mail.gmail.com

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Re: Global snapshots at 2018-09-24 08:58:48 from Andrey Borodin

Responses

Re: Global snapshots at 2018-11-29 15:21:21 from Dmitry Dolgov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2018-09-26 15:09:59	Re: Allowing printf("%m") only where it actually works
Previous Message	Stephen Frost	2018-09-26 14:54:45	Re: Online verification of checksums