Re: Replication Node Identifiers and crashsafe Apply Progress

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Steve Singer <steve(at)ssinger(dot)info>
Subject: Re: Replication Node Identifiers and crashsafe Apply Progress
Date: 2013-11-19 12:40:30
Message-ID: CA+TgmoaMAbzJiRoUPWOKYeyk5-615bxw-nrvtubWysVqM6-XCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 14, 2013 at 12:26 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> As you know, the reason we are working changeset extraction is that we
> want to build logical unidirection and bidirectional replication
> ontop. To use changeset extraction effectively, I think one set of
> related features ontop is very useful:
>
> When extracting changes using the changeset extraction patchset (latest
> version at [1]) the START_LOGICAL_REPLICATION command is used to stream
> changes from a source system. When started it will continue to send
> changes as long as the connection is up or it is aborted. For obvious
> performance reasons it will *not* wait for an ACK for each transaction
> commit it streams out.
> Instead it relies on the receiver, exactly as in physical replication,
> sending feedback messages containing the LSN up to which data has safely
> been received.
> That means frequently something like:
> walsender: => COMMIT 0/10000000
> walsender: => COMMIT 0/10000200
> walsender: => COMMIT 0/10000400
> walsender: => COMMIT 0/10000600
> receiver: <= ACKNOWLEDGE 0/10000270
> walsender: => COMMIT 0/10000800
> is possible and important for performance. I.e. the server has streamed
> out more changes than it got confirmation for.
>
> So, when the the replication connection goes down, e.g. because the
> receiving side has crashed, we need to tell the server from where to
> start. Every position between the last ACKed and the end of WAL is
> legal.
> The receiver then can ask the source to start replication from the last
> replayed commit using START_LOGICAL_REPLICATION 'slot_name'
> '0/10000600' which would then re-stream all the changes in the
> transaction that committe at 0/10000600 and all that follow.
>
> But for that the receiving side needs to know up to where changes have
> been applied. One relatively easy solution for that is that the
> receiving side does something like:
> UPDATE replication_progress SET lsn = '0/10000600' WHERE source_id = ...;
> before the end of every replayed transaction. But that obviously will
> quickly cause bloat.
>
> Our solution to that is that a replaying process can tell the backend
> that it is currently doing so and setup three variables for every
> transaction:
> 1) an identifier for the the source database
> 2) the LSN at which the replayed transaction has committed remotely
> 3) the time at which the replayed transaction has committed remotely
>
> When the transaction then commits the commit record will set the
> XACT_CONTAINS_ORIGIN flag to ->xinfo and will add that data to the end
> of the commit record. During crash recovery the startup process will
> remember the newest LSN for each remote database in shared memory.
>
> This way, after a crash, restart, disconnect the replay process can look
> into shared memory and check how far it has already replayed and restart
> seamlessly. With minimal effort.

It would be much less invasive for the replication apply code to fsync
its own state on the apply side. Obviously, that means doubling the
fsync rate, which is not appealing, but I think that's still a useful
way to think about what you're aiming to accomplish here: avoid
doubling the fsync rate when applying remote transactions in a
crash-safe manner.

Although I agree that we need a way to do that, I don't have a
particularly warm and fuzzy feeling about this particular proposal:
there are too many bits of it that feel like entirely arbitrary design
decisions. If we're going to build a full-fledged logical replication
solution into core, attempting to obsolete Slony and Bucardo and
Londiste and everything that's out there, then I think we have a great
deal of design work that we have to do before we start committing
things, or even finalizing designs. If we're going to continue with
the philosophy of building a toolkit that can serve as a building
block for multiple solutions, then color me unconvinced that this will
do the job.

If we made the xlog system truly extensible, that seems like it'd
punch your ticket here. I'm not sure how practical that is, though.

> We previously discussed the topic and some were very adverse to using
> any sort of numeric node identifiers across systems and suggested that
> those should only be used internally. So what the attached patch does is
> to add a new shared system catalog called 'pg_replication_identifier'
> (suggestions for a better name welcome) which translates a number of
> identifying traits into a numeric identifier.
> The set of identifiers currently are:
> * the sysid of the remote system, combined with the remote TLI
> * the oid of the local database
> * the oid of the remote database
> * an optional name
> but that's just what we needed in our multimaster prototype, and not
> what I necessarily think is correct.

The fact that you've included both local and remote database OIDs
seems wrong; shouldn't the replication identifier only serve to
identify the source node, not the replication stream? What if you
want to replicate from table A to table B within the same database?
(e.g. so that you can lock them for DDL changes in alternation) What
if you want to replicate tables A and B in one database into a
database on another node, but making the two of them independent
replication streams - e.g. because we know that transactions on table
A will be large but not latency-sensitive, but transactions on B will
be small but highly sensitive to replication delay? What if we want
to use changeset extraction to produce delta relations on the base
tables for a materialized view, feed them through a bunch of
relational algebra, and apply the resulting changes to the MV, keeping
track of how far we got?

We need some kind of pretty flexible system here, if we're not to box
ourselves into a corner.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2013-11-19 12:48:24 Re: Handling GIN incomplete splits
Previous Message Andres Freund 2013-11-19 12:22:58 Re: Call flow of btinsert(PG_FUNCTION_ARGS)