Re: Replication identifiers, take 3

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 3
Date: 2014-09-26 09:05:53
Message-ID: 20140926090553.GF1169@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-09-25 22:44:49 -0400, Robert Haas wrote:
> Thanks for this write-up.
>
> On Tue, Sep 23, 2014 at 2:24 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > 1) The ability Monitor how for replication has progressed in a
> > crashsafe manner to allow it to restart at the right point after
> > errors/crashes.
> > 2) Efficiently identify the origin of individual changes and
> > transactions. In multimaster and some cascading scenarios it is
> > necessary to do so to avoid sending out replayed changes again.
> > 3) Allow to efficiently filter out replayed changes from logical
> > decoding. It's currently possible to filter changes from inside the
> > output plugin, but it's much more efficient to filter them out
> > before decoding.
>
> I agree with these goals.
>
> Let me try to summarize the information requirements for each of these
> things. For #1, you need to know, after crash recovery, for each
> standby, the last commit LSN which the client has confirmed via a
> feedback message.

I'm not sure I understand what you mean here? This is all happening on
the *standby*. The standby needs to know, after crash recovery, the
latest commit LSN from the primary that it has successfully replayed.

Say you replay the following:
SET synchronous_commit = off;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/10 */;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/20 */;

BEGIN;
INSERT INTO foo ...
COMMIT /* original LSN 0/30 */;

If postgres crashes at any point during this, we need to know whether we
successfully replayed up to 0/10, 0/20 or 0/30. Note that the problem
exists independent of s_c=off, it just excerbates the issue.

Then, after finishing recovery and discovering only 0/10 has persisted,
the standby can reconnect to the primary and do
START_REPLICATION SLOT .. LOGICAL 0/10;
and it'll receive all transactions that have committed since 0/10.

> For #2, you need to know, when decoding each
> change, what the origin node was. And for #3, you need to know, when
> decoding each change, whether it is of local origin. The information
> requirements for #3 are a subset of those for #2.

Right. For #3 it's more important to have the information available
efficiently on individual records.

> > A rather efficient solution for 1) is to attach the 'origin node' and
> > the remote commit LSN to every local commit record produced by
> > replay. That allows to have a shared memory "table" (remote_node,
> > local_lsn, remote_lsn).
>
> This seems OK to me, modulo some arguing about what the origin node
> information ought to look like. People who are not using logical
> replication can use the compact form of the commit record in most
> cases, and people who are using logical replication can pay for it.

Exactly.

> > Similarly, to solve the problem of identifying the origin of changes
> > during decoding, the problem can be solved nicely by adding the origin
> > node of every change to changes/transactions. At first it might seem
> > to be sufficient to do so on transaction level, but for cascading
> > scenarios it's very useful to be able to have changes from different
> > source transactions combinded into a larger one.
>
> I think this is a lot more problematic. I agree that having the data
> in the commit record isn't sufficient here, because for filtering
> purposes (for example) you really want to identify the problematic
> transactions at the beginning, so you can chuck their WAL, rather than
> reassembling the transaction first and then throwing it out. But
> putting the origin ID in every insert/update/delete is pretty
> unappealing from my point of view - not just because it adds bytes to
> WAL, though that's a non-trivial concern, but also because it adds
> complexity - IMHO, a non-trivial amount of complexity. I'd be a lot
> happier with a solution where, say, we have a separate WAL record that
> says "XID 1234 will be decoding for origin 567 until further notice".

I think it actually ends up much simpler than what you propose. In the
apply process, you simply execute
SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
or it's C equivalent. That sets a global variable which XLogInsert()
includes the record.
Note that this doesn't actually require any additional space in the WAL
- padding bytes in struct XLogRecord are used to store the
identifier. These have been unused at least since 8.0.

I don't think a solution which logs the change of origin will be
simpler. When the origin is in every record, you can filter without keep
track of any state. That's different if you can switch the origin per
tx. At the very least you need a in memory entry for the origin.

> > == Replication Identifiers ==
> >
> > The above explains the need to have as small as possible identifiers
> > for nodes. Two years back I'd suggested that we rely on the user to
> > manually assign 16bit ids to individual nodes. Not very surprisingly
> > that was shot down because a) 16bit numbers are not descriptive b) a
> > per node identifier is problematic because it prohibits replication
> > inside the same cluster.
> >
> > So, what I've proposed since is to have two different forms of
> > identifiers. A long one, that's as descriptive as
> > $replication_solution wants. And a small one (16bit in my case) that's
> > *only meaningful within one node*. The long, user facing, identifier
> > is locally mapped to the short one.
> >
> > In the first version I proposed these long identifiers had a fixed
> > form, including the system identifier, timeline id, database id, and a
> > freeform name. That wasn't well received and I agree that that's too
> > narrow. I think it should be a freeform text of arbitrary length.
> >
> > Note that it depends on the replication solution whether these
> > external identifiers need to be coordinated across systems or not. I
> > think it's *good* if we don't propose a solution for that - different
> > replication solutions will have different requirements.
>
> I'm pretty fuzzy on how this actually works. Like, the short form
> here is just getting injected into WAL by the apply process. How does
> it figure out what value to inject?

Thy apply process once does
SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
that looks up the internal identifier and stores it in a global
variable. That's then filled in struct XLogRecord.
To setup the origin LSN of a transaction
SELECT pg_replication_identifier_setup_tx_origin('0/123456', '2013-12-11 15:14:59.219737+01')
is used. If setup that'll emit the 'extended' commit record with the
remote commit LSN.

> What if it injects a value that doesn't have a short-to-long mapping?

Shouldn't be possible unless you drop a replication identifier after it
has been setup by *_replaying_from(). If we feel that's a actually
dangerous scenario we can prohibit it with a session level lock.

> What's the point of the short-to-long mappings in the first place? Is
> that only required because of the possibility that there might be
> multiple replication solutions in play on the same node?

In my original proposal, 2 years+ back, I only used short numeric
ids. And people didn't like it because it requires coordination between
the replication solutions and possibly between servers. Using a string
identifier like in the above allows to easily build unique names; and
allows every solution to add the information it needs into replication
identifiers.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2014-09-26 09:12:30 Re: RLS feature has been committed
Previous Message Dev Kumkar 2014-09-26 08:06:22 Re: [GENERAL] pg_multixact issues