Re: Logical replication and multimaster

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication and multimaster
Date: 2015-12-03 14:52:15
Message-ID: CAMsr+YE9xgD_LoOm_LmSs9_MiuLgOay=LziWLFvGNN6xfKB-sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 December 2015 at 20:39, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 30 November 2015 at 17:20, Konstantin Knizhnik <
> k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>
>
>> But looks like there is not so much sense in having multiple network
>> connection between one pair of nodes.
>> It seems to be better to have one connection between nodes, but provide
>> parallel execution of received transactions at destination side. But it
>> seems to be also nontrivial. We have now in PostgreSQL some infrastructure
>> for background works, but there is still no abstraction of workers pool and
>> job queue which can provide simple way to organize parallel execution of
>> some jobs. I wonder if somebody is working now on it or we should try to
>> propose our solution?
>>
>
> There are definitely two clear places where additional help would be
> useful and welcome right now.
>

Three IMO, in that a re-usable, generic bgworker pool driven by shmem
messaging would be quite handy. We'll want something like that when we have
transaction interleaving.

I think Konstantin's design is a bit restrictive at the moment; at the
least it needs to address sticky dispatch, and it almost certainly needs to
be using dynamic bgworkers (and maybe dynamic shmem too) to be flexible.
Some thought will be needed to make sure it doesn't rely on !EXEC_BACKEND
stuff like passing pointers to fork()ed data from postmaster memory too.
But the general idea sounds really useful, and we'll either need that or to
use async libpq for concurrent apply.

> 1. Allowing logical decoding to have a "speculative pre-commit data"
> option, to allow some data to be made available via the decoding api,
> allowing data to be transferred prior to commit.
>

Petr, Andres and I tended to refer to that as interleaved transaction
streaming. The idea being to send changes from multiple xacts mixed
together in the stream, identifed by an xid sent with each message, as we
decode them from WAL. Currently we add them to a local reorder buffer and
send them only in commit order after commit.

This moves responsibility for xact ordering (and buffering, if necessary)
to the downstream. It introduces the possibility that concurrently replayed
xacts could deadlock with each other and a few exciting things like that,
too, but with the payoff that we can continue to apply small transactions
in a timely manner even as we're streaming a big transaction like a COPY.

We could possibly enable interleaving right from the start of the xact, or
only once it crosses a certain size threshold. For your purposes Konstantin
you'd want to do it right from the start since latency is crucial for you.
For pglogical we'd probably want to buffer them a bit and only start
streaming if they got big.

This would allow us to reduce the delay that occurs at commit, especially
> for larger transactions or very low latency requirements for smaller
> transactions. Some heuristic or user interface would be required to decide
> whether to and which transactions might make their data available prior to
> commit.
>

I imagine we'd have a knob, either global or per-slot, that sets a
threshold based on size in bytes of the buffered xact. With 0 allowed as
"start immediately".

> And we would need to send abort messages should the transactions not
> commit as expected. That would be a patch on logical decoding and is an
> essentially separate feature to anything currently being developed.
>

I agree that this is strongly desirable. It'd benefit anyone using logical
decoding and would have wide applications.

> 2. Some mechanism/theory to decide when/if to allow parallel apply.
>

I'm not sure it's as much about allowing it as how to do it.

> We already have working multi-master that has been contributed to PGDG, so
> contributing that won't gain us anything.
>

Namely BDR.

> There is a lot of code and pglogical is the most useful piece of code to
> be carved off and reworked for submission.
>

Starting with the already-published output plugin, with the downstream to
come around the release of 9.5.

> Having a single network connection between nodes would increase efficiency
> but also increase replication latency, so its not useful in all cases.
>

If we interleave messages I'm not sure it's too big a problem. Latency
would only become an issue there if a big single row (big Datum contents)
causes lots of small work to get stuck behind it.

IMO this is a separate issue to be dealt with later.

I think having some kind of message queue between nodes would also help,
> since there are many cases for which we want to transfer data, not just a
> replication data flow. For example, consensus on DDL, or MPP query traffic.
> But that is open to wider debate.
>

Logical decoding doesn't really define any network protocol at all. It's
very flexible, and we can throw almost whatever we want down it. The
pglogical_output protocol is extensible enough that we can just add
additional messages when we need to, making them opt-in so we don't break
clients that don't understand them.

I'm likely to need to do that soon for sequence-advance messages if I can
get logical decoding of sequence advance working.

We might want a way to queue those messages at a particular LSN, so we can
use them for replay barriers etc and ensure they're crash-safe. Like the
generic WAL messages used in BDR and proposed for core. Is that what you're
getting at? WAL messages would certainly be nice, but I think we can mostly
if not entirely avoid the need for them if we have transaction interleaving
and concurrent transaction support.

Somewhat related, I'd quite like to be able to send messages from
downstream back to upstream, where they're passed to a hook on the logical
decoding plugin. That'd eliminate the need to do a whole bunch of stuff
that currently has to be done using direct libpq connections or a second
decoding slot in the other direction. Basically send a CopyData packet in
the other direction and have its payload passed to a new hook on output
plugins.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2015-12-03 14:58:08 Re: proposal: add 'waiting for replication' to pg_stat_activity.state
Previous Message Tom Lane 2015-12-03 14:51:49 Re: [RFC] overflow checks optimized away