Replication identifiers, take 3

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Replication identifiers, take 3
Date: 2014-09-23 18:24:22
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I've previously started two threads about replication identifiers. Check

The've also been discussed in the course of another thread:

As the topic has garnered some heat and confusion I thought it'd be
worthwile to start afresh with an explanation why I think they're

I don't really want to discuss about implementation specifics for now,
but rather about (details of the) concept. Once we've hashed those out,
I'll adapt the existing patch to match them.

There are three primary use cases for replication identifiers:
1) The ability Monitor how for replication has progressed in a
crashsafe manner to allow it to restart at the right point after
2) Efficiently identify the origin of individual changes and
transactions. In multimaster and some cascading scenarios it is
necessary to do so to avoid sending out replayed changes again.
3) Allow to efficiently filter out replayed changes from logical
decoding. It's currently possible to filter changes from inside the
output plugin, but it's much more efficient to filter them out
before decoding.

== Logical Decoding Background ==

To understand the need for 1) it's important to roughly understand how
logical decoding/walsender streams changes and handles feedback from
the receiving side. A walsender performing logical decoding
*continously* sends out transactions. As long as there's new local
changes (i.e. unprocessed WAL) and the network buffers aren't full it
will send changes. *Without* waiting for the client. Everything else
would lead to horrible latency.

Because it sends data without waiting for the client to have processed
them it obviously can't remove resources that are needed to stream
them out again. The client or network connection could crash after

To let the sender know when it can remove resources the receiver
regularly sends back 'feedback' messages acknowledging up to where
changes have been safely received. Whenever such a feedback message
arrives the sender can release resources that aren't needed to decode
the changes below that horizon.

When the receiver ask the server to stream changes out it tells the
sender at which LSN it should start sending changes. All
*transactions* that *commit* after that LSN are sent out. Possibly

== Crashsafe apply ==

Based on those explanations, when building a logical replication
solution on top of logical decoding, one must remember the latest
*remote* LSN that already has been replayed. So that, when the apply
process or the whole database crashes, it is possibly to ask for all
changes since the last transaction that has been successfully applied.

The trivial solution here is to have a table (remote_node,
last_replayed_lsn) and update it for every replayed
transaction. Unfortunately that doesn't perform very well because that
table quickly gets heavily bloated. It's also hard to avoid page level
contention when replaying transaction from multiple remote
nodes. Additionally these changes have to be filtered out when
replicating these changes in a cascading fashion.

To do this more efficiently there needs to be a crashsafe way to
associate the latest successfully replayed remote transaction.

== Identify the origin of changes ==

Say you're building a replication solution that allows two nodes to
insert into the same table on two nodes. Ignoring conflict resolution
and similar fun, one needs to prevent the same change being replayed
over and over. In logical replication the changes to the heap have to
be WAL logged, and thus the *replay* of changes from a remote node
produce WAL which then will be decoded again.

To avoid that it's very useful to tag individual changes/transactions
with their 'origin'. I.e. mark changes that have been directly
triggered by the user sending SQL as originating 'locally' and changes
originating from replaying another node's changes as originating
somewhere else.

If that origin is exposed to logical decoding output plugins they can
easily check whether to stream out the changes/transactions or not.

It is possible to do this by adding extra columns to every table and
store the origin of a row in there, but that a) permanently needs
storage b) makes things much more invasive.

== Proposed solution ==

These two fundamental problems happen to have overlapping

A rather efficient solution for 1) is to attach the 'origin node' and
the remote commit LSN to every local commit record produced by
replay. That allows to have a shared memory "table" (remote_node,
local_lsn, remote_lsn).

During replay that table is kept up2date in sync with transaction
commits. If updated within the transaction commit's critical section
it's guaranteed to be correct, even if transactions can abort due to
constraint violations and such. When the cluster crashes it can be
rebuilt during crash recovery, by updating values whenever a commit
record is read.

The primary complexity here is that the identification of the
'origin' node should be as small as possible to keep the WAL volume

Similarly, to solve the problem of identifying the origin of changes
during decoding, the problem can be solved nicely by adding the origin
node of every change to changes/transactions. At first it might seem
to be sufficient to do so on transaction level, but for cascading
scenarios it's very useful to be able to have changes from different
source transactions combinded into a larger one.

Again the primary problem here is how to efficiently identify origin

== Replication Identifiers ==

The above explains the need to have as small as possible identifiers
for nodes. Two years back I'd suggested that we rely on the user to
manually assign 16bit ids to individual nodes. Not very surprisingly
that was shot down because a) 16bit numbers are not descriptive b) a
per node identifier is problematic because it prohibits replication
inside the same cluster.

So, what I've proposed since is to have two different forms of
identifiers. A long one, that's as descriptive as
$replication_solution wants. And a small one (16bit in my case) that's
*only meaningful within one node*. The long, user facing, identifier
is locally mapped to the short one.

In the first version I proposed these long identifiers had a fixed
form, including the system identifier, timeline id, database id, and a
freeform name. That wasn't well received and I agree that that's too
narrow. I think it should be a freeform text of arbitrary length.

Note that it depends on the replication solution whether these
external identifiers need to be coordinated across systems or not. I
think it's *good* if we don't propose a solution for that - different
replication solutions will have different requirements.

What I've previously suggested (and which works well in BDR) is to add
the internal id to the XLogRecord struct. There's 2 free bytes of
padding that can be used for that purpose.

There's a example of how this can be used from SQL at
That version is built on top of commit timestamps, but that only shows
because pg_replication_identifier_setup_tx_origin() allows to set the
source transaction's timestamp.

With that, far too long, explanation, is it clearer what I think
replication identifiers are for? What's your thougts?


Andres Freund

Andres Freund
PostgreSQL Development, 24x7 Support, Training & Services


Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Smith 2014-09-23 18:59:48 Re: RLS feature has been committed
Previous Message Fabien COELHO 2014-09-23 18:15:38 Re: add modulo (%) operator to pgbench