Re: Replication Node Identifiers and crashsafe Apply Progress

From: Steve Singer <steve(at)ssinger(dot)info>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-19 23:49:27
Message-ID: BLU0-SMTP40787261A4CCD7A35F1028DCE70@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/14/2013 12:26 PM, Andres Freund wrote:
> Hello,
>
> As you know, the reason we are working changeset extraction is that we
> want to build logical unidirection and bidirectional replication
> ontop. To use changeset extraction effectively, I think one set of
> related features ontop is very useful:
>
> When extracting changes using the changeset extraction patchset (latest
> version at [1]) the START_LOGICAL_REPLICATION command is used to stream
> changes from a source system. When started it will continue to send
> changes as long as the connection is up or it is aborted. For obvious
> performance reasons it will *not* wait for an ACK for each transaction
> commit it streams out.
> Instead it relies on the receiver, exactly as in physical replication,
> sending feedback messages containing the LSN up to which data has safely
> been received.
> That means frequently something like:
> walsender: => COMMIT 0/10000000
> walsender: => COMMIT 0/10000200
> walsender: => COMMIT 0/10000400
> walsender: => COMMIT 0/10000600
> receiver: <= ACKNOWLEDGE 0/10000270
> walsender: => COMMIT 0/10000800
> is possible and important for performance. I.e. the server has streamed
> out more changes than it got confirmation for.
>
> So, when the the replication connection goes down, e.g. because the
> receiving side has crashed, we need to tell the server from where to
> start. Every position between the last ACKed and the end of WAL is
> legal.
> The receiver then can ask the source to start replication from the last
> replayed commit using START_LOGICAL_REPLICATION 'slot_name'
> '0/10000600' which would then re-stream all the changes in the
> transaction that committe at 0/10000600 and all that follow.
>
> But for that the receiving side needs to know up to where changes have
> been applied. One relatively easy solution for that is that the
> receiving side does something like:
> UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
> before the end of every replayed transaction. But that obviously will
> quickly cause bloat.

I don't see how this is going to cause any more bloat than what
trigger-based slony does today with sl_confirm and I don't hear a lot of
complaints about that being a big problem. This might be because slony
doesn't do a commit on the replica for every transaction but groups the
transactions together, logical slony will behave the same way where we
would only commit on SYNC transactions.

> Our solution to that is that a replaying process can tell the backend
> that it is currently doing so and setup three variables for every
> transaction:
> 1) an identifier for the the source database
> 2) the LSN at which the replayed transaction has committed remotely
> 3) the time at which the replayed transaction has committed remotely
>
> When the transaction then commits the commit record will set the
> XACT_CONTAINS_ORIGIN flag to ->xinfo and will add that data to the end
> of the commit record. During crash recovery the startup process will
> remember the newest LSN for each remote database in shared memory.
>
> This way, after a crash, restart, disconnect the replay process can look
> into shared memory and check how far it has already replayed and restart
> seamlessly. With minimal effort.
>
> We previously discussed the topic and some were very adverse to using
> any sort of numeric node identifiers across systems and suggested that
> those should only be used internally. So what the attached patch does is
> to add a new shared system catalog called 'pg_replication_identifier'
> (suggestions for a better name welcome) which translates a number of
> identifying traits into a numeric identifier.
> The set of identifiers currently are:
> * the sysid of the remote system, combined with the remote TLI
> * the oid of the local database
> * the oid of the remote database
> * an optional name
> but that's just what we needed in our multimaster prototype, and not
> what I necessarily think is correct.
>
> The added API (which surely need some work, I am not particularly happy
> with the naming of functions for one) basically consists of two parts:
> 1) functions to query/create replication identifiers:
> * GetReplicationIdentifier(identifying traits) - search for a numeric replication identifier
> * CreateReplicationIdentifier(identifying traits) - creates a numeric replication identifier
> * GetReplicationInfoByIdentifier(numeric identifier) - returns identifying traits
>
> 2) functions to query/manipulate replication progress:
> * AdvanceReplicationIdentifier(node, local_lsn, remote_lsn)
> * XLogRecPtr RemoteCommitFromReplicationIdentifier(node)
>
> Internally the code also maintains some on-disk data which is updated
> during checkpoints to store the replication progress, otherwise it'd
> vanish if we shutdown gracefully ;).
>
> The attached code also integrates with the "commit timestamp" module
> that Alvaro submitted ([2]). Everytime a remote transaction is committed
> we store a) the remote commit's timestamp, b) the origin node id in it.
> That allows to relatively easily build multimaster systems with conflict
> resolution ontop, since whenever there's a conflict the originating
> node, and originating commit timestamp for a row can be queried
> efficiently.
>
> Having information about the origin of a change/transaction allows to
> implement complex replication topologies since the information is
> available to changeset extration output plugins.
> It allows to do write plugins that:
> * decode all changes, independent from the system they were originally
> executed on by the user
> * decode changes generated locally, but none from remote systems
> * pick and choose between those, say only decode those the receiving
> system isn't replicating from itself
>
> Questions are:
> * Which identifying traits do we want to use to identify nodes?
> * How do we want to manipulate replication identifiers? Currently they
> can only be manipulated by using C functions, which is fine for some users,
> but probably not for others?
> * Do we want to allow setting (remote_lsn, remote_timestamp,
> remote_node_id) via SQL? Currently the remote_node_id can be set as a
> GUC, but the other's can't. They probably should be a function that
> can be called instead of GUCs?

A way of advancing the replication pointer via SQL would be nice,
otherwise I'll just have to write my own C function that I will invoke
via SQL (which sin't hard but everyone would need to do the same)

> * Suggestions for better names!
> * Would slony et al need something ontop to use this?

In the slony world we identifer nodes with a 32 bit integer. I think
the idea is that I'm going to have to pass arguments into

+extern RepNodeId GetReplicationIdentifier(uint64 remotesysid, Oid remotetli,
+ Oid remotedb, Name riname,
+ Oid rilocaldb);

to map my slony concept of a node id to a 16 bit "node id" that is only
useful on the local system. In additon to a slony 32 bit node id I have
a conninfo that I can use to contact that node. I think the slon would
need to connect to the remote node with that conninfo (which it does
anyway) and get the remote oid's and then use the slony node_id
converted to a string as the "riname" value.
I would then have to invoke AdvanceReplicationIdentifier at some point
before I issue the commit.

What does building up node_id key from

(sysid, tlid, remote_dbid, local_dbid, name) get us over just mapping from an arbitrary name field to a 16 bit node_id ?
I agree with the other comments on the thread that letting the replication system figure out its own unique naming is better. If we were going t come up with a schema then I am also not sure if using the remote TLI as part of the node key is a good idea.

> Todo:
> * cleanup/naming
> * Set returning function to see the replication progress
> * remove old checkpoint files
>
> Note that this only applies a) ontop the changeset extraction code b)
> the commit timestamp code. The 'replication-identifiers' git branch
> ([3]) contains all integrated together.
>
> Comments welcome!
>
> Greetings,
>
> Andres Freund
>
> [1]http://archives.postgresql.org/message-id/20131114134647.GA26172%40alap2.anarazel.de
> [2]http://archives.postgresql.org/message-id/20131022221600.GE4987%40eldon.alvh.no-ip.org
> [3]http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/replication-identifiers
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christophe Pettus 2013-11-19 23:53:36 Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Previous Message Josh Berkus 2013-11-19 23:41:58 Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1