Re: Logical replication and multimaster

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Logical replication and multimaster
Date: 2015-12-03 01:09:46
Message-ID: CAMsr+YH+3Ts=5jXbDAUOspntqrLHFtG5y=L15KL8sdJyRck7Kw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1 December 2015 at 00:20, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
wrote:

> We have implemented ACID multimaster based on logical replication and our
DTM (distributed transaction manager) plugin.

What are you using for an output plugin and for replay?

I'd really like to collaborate using pglogical_output if at all possible.
Petr's working really hard to get the pglogical downstrem out too, with me
helping where I can.

I'd hate to be wasting time and effort working in parallel on overlapping
functionality. I did a LOT of work to make pglogical_output extensible and
reusable for different needs, with hooks used heavily instead of making
things specific to the pglogical downstream. A protocol documented in
detail. A json output mode as an option. Parameters for clients to
negotiate options. etc.

Would a different name for the upstream output plugin help?

> And according to 2ndquadrant results, BDR performance is very close to hot
> standby.
>

Yes... but it's asynchronous multi-master. Very different to what you're
doing.

> I wonder if it is principle limitation of logical replication approach
> which is efficient only for asynchronous replication or it can be somehow
> tuned/extended to efficiently support synchronous replication?
>

I'm certain there are improvements to be made for synchronous replication.

We have also considered alternative approaches:
> 1. Statement based replication.
>

Just don't go there. Really.

> It seems to be better to have one connection between nodes, but provide
> parallel execution of received transactions at destination side.

I agree. This is something I'd like to be able to do through logical
decoding. As far as I can tell there's no fundamental barrier to doing so,
though there are a few limitations when streaming logical xacts:

- We can't avoid sending transactions that get rolled back

- We can't send the commit timestamp, commit LSN, etc at BEGIN time, so
last-update-wins
conflict resolution can't be done based on commit timestamp

- When streaming, the xid must be in each message, not just in begin/commit.

- The apply process can't use the SPI to apply changes directly since we
can't multiplex transactions. It'll need to use
shmem to communicate with a pool of workers, dispatching messages to
workers as they arrive. Or it can multiplex
a set of libpq connections in async mode, which I suspect may prove to be
better.

I've made provision for streaming support in the pglogical_output
extension. It'll need core changes to allow logical decoding to stream
changes though.

Separately, I'd also like to look at decoding and sending sequence
advances, which are something that happens outside transaction boundaries.

> We have now in PostgreSQL some infrastructure for background works, but
> there is still no abstraction of workers pool and job queue which can
> provide simple way to organize parallel execution of some jobs. I wonder if
> somebody is working now on it or we should try to propose our solution?
>

I think a worker pool would be quite useful to have.

For BDR and for pglogical we had to build an infrastructure on top of
static and dynamic bgworkers. A static worker launches a dynamic bgworker
for each database. The dynamic bgworker for the database looks at
extension-provided user catalogs to determine whether it should launch more
dynamic bgworkers for each connection to a peer node.

Because the bgworker argument is a single by-value Datum the argument
passed is an index into a static shmem array of structs. The struct is
populated with the target database oid (or name, for 9.4, due to bgworker
API limitations) and other info needed to start the worker.

Because registered static and dynamic bgworkers get restarted by the
postmaster after a crash/restart cycle, and the restarted static worker
will register new dynamic workers after restart, we have to jump through
some annoying hoops to avoid duplicate bgworkers. A generation counter is
stored in postmaster memory and incremented on crash recovery then copied
to shmem. The high bits of the Datum argument to the workers embeds the
generation counter. They compare their argument's counter to the one in
shmem and exit if the counter differs, so the relaunched old generation of
workers exits after a crash/restart cycle. See the thread on
BGW_NO_RESTART_ON_CRASH for details.

In pglogical we're instead using BGW_NEVER_RESTART workers and doing
restarts ourselves when needed, ignoring the postmaster's ability to
restart bgworkers when the worker crashes.

It's likely that most projects using bgworkers for this sort of thing will
need similar functionality, so generalizing it into a worker pool API makes
a lot of sense. In the process we could really use API to examine currently
registered and running bgworkers. Interested in collaborating on that?

Another thing I've wanted as part of this work is a way to get a one-time
authentication cookie from the server that can be passed as a libpq
connection option to get a connection without having to know a password or
otherwise mess with pg_hba.conf. Basically a way to say "I'm a bgworker
running with superuser rights within Pg, and I want to make a libpq
connection to this database. I'm inherently trusted, so don't mess with
pg_hba.conf and passwords, just let me in".

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2015-12-03 01:18:29 Re: Logical replication and multimaster
Previous Message Kyotaro HORIGUCHI 2015-12-03 01:09:23 Re: psql: add \pset true/false