Re: logical decoding of two-phase transactions

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: logical decoding of two-phase transactions
Date: 2017-02-01 02:05:19
Message-ID: CAB7nPqSQPANEwaLm1GU4MYjENrRgetiRCxhWoP-ATQbutWm+Yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 31, 2017 at 6:22 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> That's where you've misunderstood - it isn't committed yet. The point or
> this change is to allow us to do logical decoding at the PREPARE TRANSACTION
> point. The xact is not yet committed or rolled back.

Yes, I got that. I was looking for a why or an actual use-case.

> Stas wants this for a conflict-free logical semi-synchronous replication
> multi master solution.

This sentence is hard to decrypt, less without "multi master" as the
concept applies basically only to only one master node.

> At PREPARE TRANSACTION time we replay the xact to
> other nodes, each of which applies it and PREPARE TRANSACTION, then replies
> to confirm it has successfully prepared the xact. When all nodes confirm the
> xact is prepared it is safe for the origin node to COMMIT PREPARED. The
> other nodes then see hat the first node has committed and they commit too.

OK, this is the argument I was looking for. So in your schema the
origin node, the one generating the changes, is itself in charge of
deciding if the 2PC should work or not. There are two channels between
the origin node and the replicas replaying the logical changes, one is
for the logical decoder with a receiver, the second one is used to
communicate the WAL apply status. I thought about something like
postgres_fdw doing this job with a transaction that does writes across
several nodes, that's why I got confused about this feature.
Everything goes through one channel, so the failure handling is just
simplified.

> Alternately if any node replies "could not replay xact" or "could not
> prepare xact" the origin node knows to ROLLBACK PREPARED. All the other
> nodes see that and rollback too.

The origin node could just issue the ROLLBACK or COMMIT and the
logical replicas would just apply this change.

> To really make it rock solid you also have to send the old and new values of
> a row, or have row versions, or send old row hashes. Something I also want
> to have, but we can mostly get that already with REPLICA IDENTITY FULL.

On a primary key (or a unique index), the default replica identity is
enough I think.

> It is of interest to me because schema changes in MM logical replication are
> more challenging awkward and restrictive without it. Optimistic conflict
> resolution doesn't work well for schema changes and once the conflicting
> schema changes are committed on different nodes there is no going back. So
> you need your async system to have a global locking model for schema changes
> to stop conflicts arising. Or expect the user not to do anything silly /
> misunderstand anything and know all the relevant system limitations and
> requirements... which we all know works just great in practice. You also
> need a way to ensure that schema changes don't render
> committed-but-not-yet-replayed row changes from other peers nonsensical. The
> safest way is a barrier where all row changes committed on any node before
> committing the schema change on the origin node must be fully replayed on
> every other node, making an async MM system temporarily sync single master
> (and requiring all nodes to be up and reachable). Otherwise you need a way
> to figure out how to conflict-resolve incoming rows with missing columns /
> added columns / changed types / renamed tables etc which is no fun and
> nearly impossible in the general case.

That's one vision of things, FDW-like approaches would be a second,
but those are not able to pass down utility statements natively,
though this stuff can be done with the utility hook.

> I think the purpose of having the GID available to the decoding output
> plugin at PREPARE TRANSACTION time is that it can co-operate with a global
> transaction manager that way. Each node can tell the GTM "I'm ready to
> commit [X]". It is IMO not crucial since you can otherwise use a (node-id,
> xid) tuple, but it'd be nice for coordinating with external systems,
> simplifying inter node chatter, integrating logical deocding into bigger
> systems with external transaction coordinators/arbitrators etc. It seems
> pretty silly _not_ to have it really.

Well, Postgres-XC/XL save the 2PC GID for this purpose in the GTM,
this way the COMMIT/ABORT PREPARED can be issued from any nodes, and
there is a centralized conflict resolution, the latter being done with
a huge cost, causing much bottleneck in scaling performance.

> Personally I don't think lack of access to the GID justifies blocking 2PC
> logical decoding. It can be added separately. But it'd be nice to have
> especially if it's cheap.

I think it should be added reading this thread.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-02-01 02:11:19 Re: Proposal : For Auto-Prewarm.
Previous Message Michael Paquier 2017-02-01 01:26:06 Re: Refactoring of replication commands using printsimple