Re: Replication identifiers, take 4

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Replication identifiers, take 4
Date: 2015-02-16 00:21:55
Message-ID: 20150216002155.GI15326@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's my next attept attempt at producing something we can agree
upon.

The major change that might achieve that is that I've now provided a
separate method to store the origin_id of a node. I've made it
conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both
paths. That new method uses Heikki's xlog rework to dynamically add the
origin to the record if a origin is set up. That works surprisingly
simply.

Other changes:

* Locking preventing several backends to replay changes at the same
time. This is actually overly restrictive in some cases, but I think
good enough for now.
* Logical decoding grew a filter_by_origin callback that allows to
ignore changes that were replayed on a remote system. Such filters are
executed before much is done with records, potentially saving a fair
bit of costs.
* Rebased. That took a bit due the xlog and other changes.
* A couple more SQL interface functions (like dropping a replication
identifier).

I also want to quickly recap replication identifiers, given that
in-person conversations with several people proved that the concept was
slightly misunderstood:

Think about a logical replication solution trying to replay changes. The
postmaster in which the data is replayed into crashes every now and
then. Replication identifiers allow you to do something like:

do_replication()
{
source = ConnectToSourceSystem('mysource');
target = ConnectToSourceSystem('target');

# mark we're replayin
target.exec($$SELECT pg_replication_identifier_setup_replaying_from('myrep_mysource')$$);
# get how far we've replayed last time round
remote_lsn = target.exec($$SELECT remote_lsn FROM pg_get_replication_identifier_progress WHERE external_id = 'myrep_mysource');

# and now replay changes
copystream = source.exec('START_LOGICAL_REPLICATION SLOT ... START %x', remote_lsn);

while (record = copystream.get_record())
{
if (record.type = 'begin')
{
target.exec('BEGIN');
# setup the position of this individual xact
target.exec('SELECT pg_replication_identifier_setup_tx_origin($1, $2);',
record.origin_lsn, record.origin_commit_timestamp);
}
else if (record.type = 'change')
target.exec(record.change_sql)
else if (record.type = 'commit')
target.exec('COMMIT');
}
}

A non pseudocode version of the above would be safe against crashes of
both the source and the target system. If the target system crashes the
replication identifier logic will recover how far we replayed during
crash recovery. If the source system crashes/disconnects we'll have the
current value in memory. Note that this works perfectly well if the
target system (and obviously the source system, but that's obvious) use
synchronous_commit = off - we'll not miss any changes.

Furthermore the fact that the origin of records is recorded allows to
avoid decoding them in logical decoding. That has both efficiency
advantages (we can do so before they are stored in memory/disk) and
functionality advantages. Imagine using a logical replication solution
to replicate inserts to a single table between two databases where
inserts are allowed on both - unless you prevent the replicated inserts
from being replicated again you obviously have a loop. This
infrastructure lets you avoid that.

The SQL interface consists out of:
# manage existance of identifiers
internal_id pg_replication_identifier_create(external_id);
void pg_replication_identifier_drop(external_id);

# replay management
void pg_replication_identifier_setup_replaying_from(external_id);
void pg_replication_identifier_reset_replaying_from();
bool pg_replication_identifier_is_replaying();
void pg_replication_identifier_setup_tx_origin(remote_lsn, remote_commit_time);

# replication progress status view
SELECT * FROM pgreplication_identifier_progress;

# replicatation identifiers
SELECT * FROM pg_replication_identifier;

Petr has developed (for UDR, i.e. logical replication ontop of 9.4) a
SQL reimplementation of replication identifiers and that has proven that
for busier workloads doing a table update to store the replication
progress indeed has a noticeable overhead. Especially if there's some
longer running activity on the standby.

The bigger questions I have are:

1) Where to store the origin. I personally still think that using the
padding is fine. Now that I have proven that it's pretty simple to
store additional information the argument that it might be needed for
something else doesn't really hold anymore. But I can live with the
other solution as well - 3 bytes additional overhead ain't so bad.

2) If we go with the !REPLICATION_IDENTIFIER_REUSE_PADDING solution, do
we want to store the origin only on relevant records? That'd be
XLOG_HEAP_INSERT/XLOG_HEAPMULTI_INSERT/XLOG_HEAP_UPDATE //
XLOG_XACT_COMMIT/XLOG_XACT_COMMIT_PREPARED. I'm thinking of something
like XLogLogOriginIfAvailable() before the emitting log
XLogInsert()s.

3) There should be a lwlock for the individual replication identifier
progress slots.

4) Right now identifier progress is stored during checkpoints in special
files - maybe it'd be better to store them inside the checkpoint
record somehow. We read that even after a clean shutdown, so that
should be fine.

5) I'm think there are issues with a streaming replication standby if
many identifiers are created/dropped. Those shouldn't be too hard to
fix.

6) Obviously the hack in bootstrap.c to get riname marked NOT NULL isn't
acceptable. Either I need to implement boostrap support for marking
varlenas NOT NULL as discussed nearby or replace the syscache lookup
with a index lookup.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-02-16 00:24:21 Re: Replication identifiers, take 4
Previous Message Andrew Gierth 2015-02-15 23:50:37 Re: Really bad blowups with hash outer join and nulls