Re: logical changeset generation v3 - Source for Slony

From: Steve Singer <steve(at)ssinger(dot)info>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: logical changeset generation v3 - Source for Slony
Date: 2012-11-18 03:50:35
Message-ID: BLU0-SMTP30D7892E90D1FF9E4EA77EDC570@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

First, you can add me to the list of people saying 'wow', I'm impressed.

The approach I am taking to reviewing this to try and answer the
following question

1) How might a future version of slony be able to use logical
replication as described by your patch and design documents
and what would that look like.

2) What functionality is missing from the patch set that would stop me
from implementing or prototyping the above.

Connecting slon to the remote postgresql

Today the slony remote listener thread queries a bunch of events from
sl_event for a batch of SYNC events. Then the remote helper thread
queries data from sl_log_1 and sl_log_2. I see this changing, instead
the slony remote listener thread would connect to the remote system and
get a logical replication stream.

1) Would slony connect as a normal client connection and call
something like 'select pg_slony_process_xlog(...)' to get bunch of
logical replication
change records to process.
2) Would slony connect as a replication connection similar to how the
pg_receivelog program does today and then process the logical changeset
outputs. Instead of writing it to a file (as pg_receivelog does)

It seems that the second approach is what is encouraged. I think we
would put a lot of the pg_receivelog functionality into slon and it
would issue a command like 'INIT_LOGICAL_REPLICATION 'slony') to use the
slony logical replication plugin. Slon would also have to provide
feedback to the walsender about what it has processed so the origin
database knows what catalog snapshots can be expired. Based on
eyeballing pg_receivelog.c it seems that about half the code in the 700
file is related to command line arguments etc, and the other half is
related to looping over the copy out stream, sending feedback and other
things that we would need to duplicate in slon.

pg_receivelog.c has comment:

* We have to use postgres.h not postgres_fe.h here, because there's so
* backend-only stuff in the XLOG include files we need. But we need a
* frontend-ish environment otherwise. Hence this ugly hack.

This looks like more of a carryover from pg_receivexlog.c. From what I
can tell we can eliminate the postgres.h include if we also eliminate
the utils/datetime.h and utils/timestamp.h and instead add in:

#include "postgres_fe.h"
#define POSTGRES_EPOCH_JDATE 2451545
#define UNIX_EPOCH_JDATE 2440588
#define SECS_PER_DAY 86400
#define USECS_PER_SEC INT64CONST(1000000)
typedef int64 XLogRecPtr;
#define InvalidXLogRecPtr 0

If there is a better way of getting these defines someone should speak
up. I recall that in the past slon actually did include postgres.h and
it caused some issues (I think with MSVC win32 builds). Since
pg_receivelog.c will be used as a starting point/sample for third
parties to write client programs it would be better if it didn't
encourage client programs to include postgres.h

The Slony Output Plugin

Once we've modified slon to connect as a logical replication client we
will need to write a slony plugin.

As I understand the plugin API:
* A walsender is processing through WAL records, each time it sees a
COMMIT WAL record it will call my plugins
.change (for each change in the transaction)

* The plugin for a particular stream/replication client will see one
transaction at a time passed to it in commit order. It won't see
.change(t1) followed by .change (t2), followed by a second .change(t1).
The reorder buffer code hides me from all that complexity (yah)

From a slony point of view I think the output of the plugin will be
rows, suitable to be passed to COPY IN of the form:

origin_id, table_namespace,table_name,command_type,

This is basically the Slony 2.2 sl_log format minus a few columns we no
longer need (txid, actionseq).
command_args is a postgresql text array of column=value pairs. Ie [

I don't t think our output plugin will be much more complicated than the
test_decoding plugin. I suspect we will want to give it the ability to
filter out non-replicated tables. We will also have to filter out
change records that didn't originate on the local-node that aren't part
of a cascaded subscription. Remember that in a two node cluster slony
will have connections from A-->B and from B--->A even if user tables
only flow one way. Data that is replicated from A into B will show up in
the WAL stream for B.

Exactly how we do this filtering is an open question, I think the
output plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on. This is easy
to figure out if the output plugin is able/allowed to query its
database. Will this be possible? I would expect to be able to query the
database as it exists now(at plugin invocation time) not as it existed
in the past when the WAL was generated. In addition to the node ID I
can see us wanting to be able to query other slony tables
(sl_table,sl_set etc...)

b) What the slony node id is of the node we are streaming too. It
would be nice if we could pass extra, arbitrary data/parameters to the
output plugins that could include that, or other things. At the moment
the start_logical_replication rule in repl_gram.y doesn't allow for that
but I don't see why we couldn't make it do so.

I still see some open questions about exactly how we would filter out
data in this stage.

<editorial> Everything above deals with the postgresql side of things,
ie the patch in question or the plugin API we would have to work with.
Much of what is below deals with slony side change and might of limited
interest to some on pgsql-hackers

Slon Applying Changes

The next task we will have is to make slon and the replica instance be
able to apply these changes. In slony 2.2 we do a COPY from sl_log and
apply that stream to a table on the replica with COPY. We then have
triggers on the replica that decode the command_args and apply the
changes as
INSERT/UPDATE/DELETE statements on the user tables. I see this
continuing to work in this fashion, but there are a few special cases:

1) Changes made to sl_event on the origin will result in records in the
logical replication stream that change sl_event. In many cases we won't
just be inserting records into sl_event but we will need to instead do
the logic in remote_worker.c for processing the different types of
events. Worst case we could parse the change records we receive from
our version pg_receivellog and split the sl_event records out into a
sl_event stream and a sl_log stream. Another approach might be to have
the slony apply trigger build up a list of events that the slon
remote_worker code can than process through.

2) Slony is normally bi-directional even if user data only replicates
one way. Confirm (sl_confirm) entries go from a replica back to an
origin. In a two node origin->replica scenario for data, the way I see
this working is that the slon for the origin would connect to the
replica (like it does today).
It would receive the logical replication records, but since it isn't
subscribed to any tables it won't receive/process the WAL for
user-tables but it will still receive/process sl_confirm rows. It will
then insert the rows in sl_confirm that it 'replicated' from the remote

With what I have described so far, Slony would then be receiving a
stream of events that look like

t1-insert into foo , [id=1, name='steve']
t1-insert into bar [id=1, something='somethingelse']
t2- insert into foo [....]
t3- insert into sl_event [ev_type=SYNC, ev_origin=1,ev_seqno=12345]

Even though, from a data-correctness point of view, slony could commit
the transaction on the replica after it sees the t1 commit, we won't
want it to do commits other than on a SYNC boundary. This means that
the replicas will continue to move between consistent SYNC snapshots and
that we can still track the state/progress of replication by knowing
what events (SYNC or otherwise) have been confirmed.

This also means that slony should only provide feedback to the
walsender on SYNC boundaries after the transaction has committed on the
receiver. I don't see this as being an issue.

Setting up Subscriptions
At first we have a slon cluster with just 1 node, life is good. When a
second node is created and a path(or pair of paths) are defined between
the nodes I think they will each:
1. Connect to the remote node with a normal libpq connection.
a. Get the current xlog recptr,
b. Query any non-sync events of interest from sl_event.
2. Connect to the remote node with a logical replication connection and
start streaming logical replication changes start at the recptr we retrieved

Slon will then receive any future events from the remote sl_event as
part of the logical replication stream. It won't receive any user
tables because it isn't yet subscribed to any.

When a subscription is started, the SUBSCRIBE_SET and
ENABLE_SUBSCRIPTION events will go through sl_event and the INSERT INTO
sl_event will be part of a change record in the replication stream and
be picked up by the subscribers slon remote_worker.

The remote_worker:copy_set will then need to get a consistent COPY of
the tables in the replication set such that any changes made to the
tables after the copy is started get included in the replication
stream. The approach proposed in the DESIGN.TXT file with exporting a
snapshot sounds okay for this. I *think* slony could get by with
something less fancy as well but it would be ugly.

1. Make sure that the origin starts including change records for the
tables in the set
2. have the slon(copy_set) wait until any transactions on the origin,
that started prior to the ENABLE_SUBSCRIPTION, are committed.
Slony does this today as part of the copy_set logic.
3. Get/remember the snapshot visibility information for the COPY's
4. When we start to process change records we need to filter out
records for transactions that were already visible by the copy.

Steps 1-3 are similar to how slony works today, but step 4 will be a bit
awkward/ugly. This isn't an issue today because we are already using
the transaction visibility information for selecting from sl_log so it
works, but above I had proposed stripping the xid from the logical
change records.

Cascading Replication

The slon for B will insert records from A into B's tables. This insert
will generate WAL records on B. The slon for C should be able to pull
the data it needs (both sl_event entries with ev_origin=A, and user
table data originating on A) from B's logical replication stream. I
don't see any issues here nor do I see a need to 'cache' the data in an
sl_log type of table on B.

Reshaping Replication

In Slony replication is reshaped by two types events, a MOVE SET and a

Move Set:
A replication set might be subscribed in a cascaded fashion like

When a MOVE SET is issued node A will stop accepting new write
transactions for tables in the set. A MOVE_SET(1,A,B) event is then put
into sl_event on node A. Node A will then stop accepting new
transactions on the tables in set 1.
Node B receives the MOVE_SET command in the proper order, after it has
processed the last SYNC generated on A when A was still accepting write
transactions to those tables. When Node B processes the MOVE_SET event
then node B starts accepting write transactions on the tables. Node B
will also generates an ACCEPT_SET event. Node C will then receive the
MOVE SET (ev_origin=A) and the ACCEPT_SET(ev_origin=B) command (after
all SYNC events from A with data changes to the set) and then knows that
it should start data on those tables from B.

I don't see any of this changing with logical replication acting as the
data source.

| .
v .

Today with slony, if B is a valid failover target then it is a
forwarding node of the set. This means that B keeps a record in sl_log
of any changes originating on A until B knows that node C has received
those changes. In the event of a failover, if node C is far behind, it
can just get the missing data from sl_log on node B (the failover
target/new origin).

I see a problem with what I have discussed above, B won't explicitly
store the data from A in sl_log, a cascaded node would depend on B's WAL
The problem is that at FAILOVER time, B might have processed some
changes from A. Node C might also be processing Node B's WAL stream for
events (or data from another set). Node C will discard/not receive the
data for A's tables since it isn't subscribed to those tables from B.
What happens then if at some later point B and C receive the FAILOVER event.
What does node C do? It can't get the missing data from node A because
node A has failed, and it can't get it from node B because node C has
already processed the WAL changes from node B that included the data but
it ignored/discarded it. Maybe node C could reprocess older WAL from
node B? Maybe this forces us to keep an sl_log type structure around?

Is it complete enough to build a prototype?
I think so, the incomplete areas I see are the ones that mentioned in
the patch submission including:
* Snapshot exporting for the initial COPY
* Spilling the reorder buffer to disk

I think it would be possible to build a prototype without those even
though we'd need them before I could build a production system.

I like this design much better than the original design from the spring
that would have required keeping a catalog proxy on the decoding
machine. Based on what I've seen it should be possible to make slony
use logical replication as a source for events instead of triggers
populating sl_log.
My thinking is that we want a way for logreceiver programs to pass
arguments/parameters to the output plugins. Beyond that this looks like
something slony can use.

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2012-11-18 05:24:02 Re: Do we need so many hint bits?
Previous Message Jeff Davis 2012-11-18 02:20:33 Re: Do we need so many hint bits?