Re: pglogical - logical replication contrib module

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pglogical - logical replication contrib module
Date: 2016-01-26 15:43:18
Message-ID: CAMsr+YGTL6Vrrkidn6EKVTVt0Abofn-q+0SDYzk6aHVM5DmPxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23 January 2016 at 11:17, Steve Singer <steve(at)ssinger(dot)info> wrote:

>
> 2) Does this patch provide a set of logical replication features that meet
> many popular use-cases
>
> Below I will review some use-cases and try to assess how pglogical meets
> them.
>
> ** Streaming Postgresql Upgrade
>
> pg_upgrade is great for many situations but sometimes you don't want an in
> place upgrade but you want a streaming upgrade. Possibly because you don't
> want application downtime but instead you just want to point your
> applications at the upgraded database server in a controlled manner.
> Othertimes you
> might want an option of upgrading to a newer version of PG but maintain
> the option of having to rollback to the older version if things go badly.
>
> I think pglogical should be able to handle this use case pretty well
> (assuming the source version of PG is actually new enough to include
> pglogical).
>

Yep, it's designed significantly for that case. That's also why support
for 9.4 and 9.5 is maintained as a standalone extension, so you can get
data out of 9.4 and 9.5 easily (and for that matter, upgrade 9.4 to 9.5).

> Support for replicating sequences would need to be added before this is as
> smooth but once sequence support was added I think this would work well.
>

This will unfortunately have to be 9.6 only. We can work around it with
some limitations in a pglogical downstream in older versions, but I really
want to get time to write a v2 of the sequence decoding patch so I can get
that into 9.6.

> ** Query only replicas (with temp tables or additional indexes)
>
> Sometimes you want a replica for long running or heavy queries.
> Requirements for temp tables, additional indexes or maybe the effect on
> vacuum means that our existing WAL based replicas are unsuitable.
>
> I think pglogical should be able to handle this use case pretty well with
> the caveat being that your replica is an asynchronous replica and will
> always lag the origin by some amount.
>

You can actually run it as a synchronous replica too, with the usual
limitations that you can have only one synchronous standby at a time, etc.
Or should be able to - I haven't had a chance to write proper tests for
sync rep using pglogical yet.

Performance will currently hurt if you do big xacts. That's why we need
interleaved xact streaming support down the track.

> Pglogical doesn't have any facilities to rename the tables between the
> origin and replica but they could be added later.
>

Yep, we could do that with a hook. You couldn't use initial schema sync if
you did that, of course.

> ** Sharding
>
> Systems that do application level sharding (or even sharding with a fdw)
> often have non-sharded tables that need to be available on all shards for
> relational integrity or joins. Logical replication is one way to make
> sure that the replicated data gets to all the shards. Sharding systems
> also sometimes want
> to take the data from individual shards and replicate it to a
> consolidation server for reporting purposes.
>
> Pglogical seems to meet this use case, I guess you would have a designated
> origin for the shared data/global data that all shards would subscribe to
> with a set containing the designated data. For the consolidation use case
> you would have the consolidation server subscribe to all shards
>
> I am less clear about how someone would want DDL changes to work for these
> cases. The DDL support in the patch is pretty limited so I am not going to
> think much now about how we would want DDL to work.
>

DDL support is "version 2" material, basically.

9.5 has hooks that allow DDL deparsing to be implemented as an extension.
That extension needs to be finished off (there's some work-in-progress code
floating around from 9.5 dev) and needs to expose an API for other
extensions. Then pglogical can register hooks with the ddl deparse
extension and use that for DDL replication.

As we learned with BDR, though, DDL replication is *hard*.

For one thing PostgreSQL has global objects like users that we can't
currently capture DDL for, and then creates db-local objects that have
dependences on them. So you have to manually replicate the global objects
still. I can see some possible solutions for this, but nothing's really on
the horizon.

Additionally there a some operations that are a bit problematic for logical
replication. Full table rewrites being the main one - they clobber
replication origin information among other issues. We really need a way to
decode

ALTER TABLE blah ADD COLUMN fred integer NOT NULL DEFAULT 42;

as

BEGIN;
ALTER TABLE blah ADD COLUMN fred integer;
ALTER TABLE blah ALTER COLUMN fred DEFAULT 42;
UPDATE blah SET fred = 42;
ALTER TABLE blah ALTER COLUMN fred NOT NULL;
COMMIT;

which involves some "interesting" co-operation between DDL deparse and
logical replication. The mapping of the decoded full table rewrite to the
underlying table is a bit interesting; we just get a decode stream for a
synthetic table named "pg_temp_xxxx" where the xxxx is the table upstream
oid. A nicer API for that would be good.

** Schema changes involving rewriting big tables
>
> Sometimes you have a DDL change on a large table that will involve a table
> rewrite and the best way of deploying the change is to make the DDL change
> on a replicate then once it is finished promote the replica to the origin
> in some controlled fashion. This avoids having to lock the table on the
> origin for hours.
>
> pglogical seems to allow minor schema changes on the replica such as
> changing a type but it doesn't seem to allow a DO INSTEAD trigger on the
> replica. I don't think pglogical currently meets this use case
> particularly well
>

I'm not sure I fully understand that one.

> ** Failover
>
> WAL replication is probably a better choice for someone just looking for
> failover support from replication.

"Physical" replication as I've been trying to call it, since logical rep is
also WAL based.

I agree with you. It very definitely is.

I have a roadmap in mind for logical rep based failover. We need sequence
advance replication (or even better, sequence access mehods), an
upstream<->downstream LSN mapping and failover slots and logical decoding
of logical slots. A few bits and pieces.

Someone who is looking at pglogical for failover related use cases probably
> has one or more of the other uses cases I mentioned and wants a logical
> node to take over for a failed origin. If a node fails you can take some
> of the remaining subscribers and have them resubscribe to one of the
> remaining nodes but there is no support for a) Figuring out which of the
> remaining nodes is most ahead b) Letting the subscribers figure out which
> updates from the old origin that are missing and getting them from a
> surviving node (they can truncate and re-copy the data but that might be
> very expensive)
>

Yep. Failover slots are part of that picture, and the logical decoding of
slot positions + lsn map stuff carries on from it.

> ** Geographically distributed applications
>
> Sometimes people have database in different geographical locations and
> they want to perform queries and writes locally but replicate all the data
> to all the other locations. This is a multi-master eventually consistent
> use case.
>

Yep. That's what BDR aims for, and why the plan in 2ndQ is to rebuild BDR
around pglogical to continue the work of streaming BDR into core. You can
think of pglogical and pglogical_output as _parts of BDR_ that have been
extracted to submit into core, they've just been heavily polished up, made
much more general purpose, and had things that won't work in core yet
removed.

Hopefully we'll have full MM on top in time, but that can't all be done in
one release.

The lack of sequence support would be an issue for these use cases.

That's why we need sequence access methods. There's a patch for that in the
9.6 CF too.

> I think you could also only configure the cluster in a fully connected
> grid (with forward_origins='none'). A lot of deployments you would want
> some amount of cascading and structure which isn't yet supported. I also
> suspect that managing a grid cluster with more than a handful of nodes will
> be unwieldy (particularly compared to some of the eventual consistent nosql
> alternatives)
>

I envision a management layer on top for that, where pglogical forms an
underlying component.

The features BDR has that were removed for pglogical are probably really
> useful for this use-case (which I think was the original BDR use-case)
>

Yep. They were removed mainly because they can't work with core until some
other patches get in too. Also just to keep the first pglogical submission
vaguely practical and manageable.

> ** Scaling across multiple machines
>
> Sometimes people ask for replication systems that let them support more
> load than a single database server supports but with consistency. Other
> use-case applies if you want 'eventually consistent' this use case is for
> situations where you want something other than eventual consistent.
>
> I don't think pglogical is intended to address this.
>

Correct. That's more like postgres-XL, where you have a distributed lock
manager, distributed transaction manager, etc.

pglogical (or the output plugin at least) can form part of such a solution,
and there's an experiment being contemplated right now to use pglogical as
the data replication transport in postgres-XL. But it doesn't attempt to
provide a whole solution there, only one component.

> Metadata is not transferred between nodes. What I mean by this is that
> nodes don't have a global view of the cluster they know about their own
> subscriptions but nothing else. This is different than a system like slony
> where sl_node and sl_subscription contain a global view of your cluster
> state. Not sending metadata to all nodes in the cluster simplifies a bunch
> of things (you don't have to worry about sending metadata around and if a
> given piece of metadata is stale) but the downside is that I think the
> tooling to perform a lot of cluster reconfigure operations will need to be
> a lot smarter.
>

Yep. We're going to need a management layer on top for building and
monitoring non-trivial node graphs. Whether in core or not.

Petr and I found that trying to design a schema that could fit all use
cases while preserving a system-wide view of the node graph was
impractical, if not outright impossible. There are quite conflicting use
cases: mesh multi-master wants to see everything, whereas if you have three
upstreams feeding into a data aggregator that then replicates to other
nodes you don't particularly want the leaf nodes worrying about the
upstream origin servers.

> Petr, and Craig have you thought about how you might support getting the
> cluster back into a sane state after a node fails with minimal pain.
>

Yes.

There are really two approaches. One is having a physical standby where you
fail over to a streaming physical replica and your slot state on logical
slots is preserved. For that we need failover slots (per the patch to 9.6).

The other is to use logical failover, where there's a logical replica that
you can switch leaf nodes to point to. For that we need a way to record
slot advances on one node, send them on the wire and interpret them
usefully on another node. Hence the outlined support for logical decoding
of logical slot create/drop/update, and a lsn map. I haven't thought as
hard about this one yet.

There's another thing you need for multimaster/mesh systems where there's a
graph not a simple tree. That's the ability to lazily advance a slot so
that when a node fails you can find the peer that replayed the furthest in
that node's history and ask it to send you the changes from the lost node.
You have to be able to go back in time on the slot to the most recent point
you have a local copy of the other node's state. Turns out that's not hard,
you just delay advancing the slot. You also have to be able to replay it
again with a filter that sends you only that node's changes. That's also
not hard using replication origins. There are some hairy complexities when
it comes to multi-master conflict resolution though, where changes to data
come from more than one node. That's a "for later" problem.

> I am concerned about testing, I don't think the .sql based regression
> tests are going to adequately test a replication system that supports
> concurrent activity on different databases/servers.

I agree that we can't rely only on that.

This is part of a bigger picture in Pg where we just don't test multi-node
stuff. Failover, replication, etc is ignored in the tests. The TAP based
stuff looks to change that and I suspect we'd have to investigate whether
it's possible to build on top of that for more comprehensive testing.

Thanks again for the review work, I know it takes serious time and effort
and I appreciate it.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2016-01-26 15:51:56 Re: Proposal:Use PGDLLEXPORT for libpq
Previous Message Konstantin Knizhnik 2016-01-26 15:13:26 Re: Batch update of indexes