Re: Global snapshots

From: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Global snapshots
Date: 2018-05-16 12:02:02
Message-ID: 26E16795-5EE2-4BF0-A23A-C3E827959541@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On 15 May 2018, at 15:53, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Actually, I think if we're going to pursue that approach, we ought to
> back off a bit from thinking about global snapshots and think about
> what kind of general mechanism we want. For example, maybe you can
> imagine it like a message bus, where there are a bunch of named
> channels on which the server publishes messages and you can listen to
> the ones you care about. There could, for example, be a channel that
> publishes the new system-wide globalxmin every time it changes, and
> another channel that publishes the wait graph every time the deadlock
> detector runs, and so on. In fact, perhaps we should consider
> implementing it using the existing LISTEN/NOTIFY framework: have a
> bunch of channels that are predefined by PostgreSQL itself, and set
> things up so that the server automatically begins publishing to those
> channels as soon as anybody starts listening to them. I have to
> imagine that if we had a good mechanism for this, we'd get all sorts
> of proposals for things to publish. As long as they don't impose
> overhead when nobody's listening, we should be able to be fairly
> accommodating of such requests.
>
> Or maybe that model is too limiting, either because we don't want to
> broadcast to everyone but rather send specific messages to specific
> connections, or else because we need a request-and-response mechanism
> rather than what is in some sense a one-way communication channel.
> Regardless, we should start by coming up with the right model for the
> protocol first, bearing in mind how it's going to be used and other
> things for which somebody might want to use it (deadlock detection,
> failover, leader election), and then implement whatever we need for
> global snapshots on top of it. I don't think that writing the code
> here is going to be hugely difficult, but coming up with a good design
> is going to require some thought and discussion.

Well, it would be cool to have some general mechanism to unreliably send
messages between postgres instances. I was thinking about the same thing
mostly in context of our multimaster, where we have an arbiter bgworker
which collects 2PC responses and heartbeats from other nodes on different
TCP port. It used to have some logic inside but evolved to just sending
messages from shared memory out queue and wake backends upon message arrival.
But necessity to manage second port is painful and error-prone at least
from configuration point of view. So it would be nice to have more general
mechanism to exchange messages via postgres port. Ideally with interface
like in shm_mq: send some messages in one queue, subscribe to responses
in different. Among other thing that were mentioned (xmin, deadlock,
elections/heartbeats) I especially interested in some multiplexing for
postgres_fdw, to save on context switches of individual backends while
sending statements.

Talking about model, I think it would be cool to have some primitives like
ones provided by ZeroMQ (message push/subscribe/pop) and then implement
on top of them some more complex ones like scatter/gather.

However, that's probably topic for a very important, but different thread.
For the needs of global snapshots something less ambitious will be suitable.

> And, for that matter, I think the same thing is true for global
> snapshots. The coding is a lot harder for that than it is for some
> new subprotocol, I'd imagine, but it's still easier than coming up
> with a good design.

Sure. This whole global snapshot thing experienced several internal redesigns,
before becoming satisfactory from our standpoint. However, nothing refraining
us from next iterations. In this regard, it is interesting to also hear comments
from Postgres-XL team -- from my experience with XL code this patches in
core can help XL to drop a lot of visibility-related ifdefs and seriously
offload GTM. But may be i'm missing something.

> I guess it seems to me that you
> have some further research to do along the lines you've described:
>
> 1. Can we hold back xmin only when necessary and to the extent
> necessary instead of all the time?
> 2. Can we use something like an STO analog, maybe as an optional
> feature, rather than actually holding back xmin?

Yes, to both questions. I'll implement that and share results.

> And I'd add:
>
> 3. Is there another approach altogether that doesn't rely on holding
> back xmin at all?

And for that question I believe the answer is no. If we want to keep
MVCC-like behaviour where read transactions aren't randomly aborted, we
will need to keep old versions. Disregarding whether it is local or global
transaction. And to keep old versions we need to hold xmin to defuse HOT,
microvacuum, macrovacuum, visibility maps, etc. At some point we can switch
to STO-like behaviour, but that probably should be used as protection from
unusually long transactions rather then a standard behavior.

> For example, if you constructed the happens-after graph between
> transactions in shared memory, including actions on all nodes, and
> looked for cycles, you could abort transactions that would complete a
> cycle. (We say A happens-after B if A reads or writes data previously
> written by B.) If no cycle exists then all is well.

Well, again, it seem to me that any kind of transaction scheduler that
guarantees that RO will not abort (even if it is special kind of RO like
read only deferred) needs to keep old versions.

Speaking about alternative approaches, good evaluation of algorithms
can be found in [HARD17]. Postgres model is close to MVCC described in article
and if we enable STO with small timeout then it will be close to TIMESTAMP
algorithm in article. Results shows that both MVCC and TIMESTAMP are less
performant then CALVIN approach =) But that one is quite different from what
is done in Postgres (and probably in all other databases except Calvin/Fauna
itself) in last 20-30 years.

Also looking through bunch of articles I found that one of the first articles
about MVCC [REED78] (I though first was [BERN83], but actually he references
bunch of previous articles and [REED78] is one them) was actually about distributed
transactions and uses more or less the same approach with pseudo-time in their
terminology to order transaction and assign snapshots.

[HARD17] https://dl.acm.org/citation.cfm?id=3055548
[REED78] https://dl.acm.org/citation.cfm?id=889815
[BERN83] https://dl.acm.org/citation.cfm?id=319998

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2018-05-16 12:19:45 Memory unit GUC range checks
Previous Message Michael Paquier 2018-05-16 11:59:23 Re: Postgres 11 release notes