Re: eXtensible Transaction Manager API (v2)

From: Kevin Grittner <kgrittn(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: eXtensible Transaction Manager API (v2)
Date: 2016-03-14 16:53:15
Message-ID: CACjxUsPwv70_OVM_58BDsFM=5GGeV3ooTsVgvzbY_GbHdw4HbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 12, 2016 at 11:21 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I'd also be interested in hearing Kevin Grittner's thoughts about
> serializability in a distributed environment, since he's obviously
> thought about the topic of serializability quite a bit.

I haven't done a thorough search of the academic literature on
this, and I wouldn't be comfortable taking a really solid position
without that; but in thinking about it it seems like there are at
least three distinct problems which may have distinct solutions.

*Physical replication* may be best handled by leveraging the "safe
snapshot" idea already implemented in READ ONLY DEFERRABLE
transactions, and passing through information in the WAL stream to
allow the receiver to identify points where a snapshot can be taken
which cannot see an anomaly. There should probably be an option to
use the last known safe snapshot or wait for a point in the stream
where one next appears. This might take as little as a bit or two
per WAL commit record. It's not clear what the processing overhead
would be -- it wouldn't surprise me if it was "in the noise", nor
would it surprise me if it wasn't. We would need some careful
benchmarking, and, if performance was an issue, A GUC to control
whether the information was passed along (and, thus, whether
SERIALIZABLE transactions were allowed on the replica).

*Logical replication* (considered for the moment in a
unidirectional context) might best be handled by some reordering of
application of the commits on the replica into "apparent order of
execution" -- which is pretty well defined on the primary based on
commit order adjusted by read-write dependencies. Basically, the
"simple" implementation would be that WAL is applied normally
unless you receive a commit record which is flagged in some way to
indicate that it is for a serializable transaction which wrote data
and at the time of commit was concurrent with at least one other
serializable transaction which had not completed and was not READ
ONLY. Such a commit would await information in the WAL stream to
tell it when all such concurrent transactions completed, and would
indicate when such a transaction had a read-write dependency *in*
to the transaction with the suspended commit; commits for any such
transactions must be moved ahead of the suspended commit. This
would allow logical replication, with all the filtering and such,
to avoid ever showing a state on the replica which contained
serialization anomalies.

*Logical replication with cycles* (where there is some path for
cluster A to replicate to cluster B, and some other path for
cluster B to replicate the same or related data to cluster A) has a
few options. You could opt for "eventual consistency" --
essentially giving up on the I in ACID and managing the anomalies.
In practice this seems to lead to some form of S2PL at the
application coding level, which is very bad for performance and
concurrency, so I tend to think it should not be the only option.
Built-in S2PL would probably perform better than having it pasted
on at the application layer through some locking API, but for most
workloads is still inferior to SSI in both concurrency and
performance. Unless a search of the literature turns up some new
alternative, I'm inclined to think that if you want to distribute a
"logical" database over multiple clusters and still manage race
conditions through use of SERIALIZABLE transactions, a distributed
SSI implementation may be the best bet. That requires the
transaction manager (or something like it) to track non-blocking
predicate "locks" (what the implementation calls a SIReadLock)
across the whole environment, as well as tracking rw-conflicts (our
short name for read-write dependencies) across the whole
environment. Since SSI also looks at the MVCC state, handling
checks of that without falling victim to race conditions would also
need to be handled somehow.

If I remember correctly, the patches to add the SSI implementation
of SERIALIZABLE transactions were about ten times the size of the
patches to remove S2PL and initially replace it with MVCC. I don't
have even a gut feel as to how much bigger the distributed form is
likely to be. On the one hand the *fundamental logic* is all there
and should not need to change; on the other hand the *mechanism*
for acquiring the data to be *used* in that logic would be
different and potentially complex.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-03-14 17:04:05 Re: WIP: Upper planner pathification
Previous Message Tom Lane 2016-03-14 16:52:17 Re: pg_stat_get_progress_info(NULL) blows up