Skip site navigation (1) Skip section navigation (2)

Re: logical changeset generation v3 - Source for Slony

From: Steve Singer <steve(at)ssinger(dot)info>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: logical changeset generation v3 - Source for Slony
Date: 2012-11-18 03:50:35
Message-ID: BLU0-SMTP30D7892E90D1FF9E4EA77EDC570@phx.gbl (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
First, you can add me to the list of people saying 'wow', I'm impressed.

The approach I am taking to reviewing this to try and answer the 
following question

1) How might a future version of slony be able to use logical 
replication as described by your patch and design documents
and what would that look like.

2) What functionality is missing from the patch set that would stop me 
from implementing or prototyping the above.

Connecting slon to the remote postgresql

Today the slony remote listener thread queries a bunch of events from 
sl_event for a batch of SYNC events. Then the remote helper thread 
queries data from sl_log_1 and sl_log_2.    I see this changing, instead 
the slony remote listener thread would connect to the remote system and 
get a logical replication stream.

   1) Would slony connect as a normal client connection and call 
something like 'select pg_slony_process_xlog(...)' to get bunch of 
logical replication
       change records to process.
   2) Would slony connect as a replication connection similar to how the 
pg_receivelog program does today and then process the logical changeset
       outputs.  Instead of writing it to a file (as pg_receivelog does)

It seems that the second approach is what is encouraged.  I think we 
would put a lot of the pg_receivelog functionality into slon and it 
would issue a command like 'INIT_LOGICAL_REPLICATION 'slony') to use the 
slony logical replication plugin.  Slon would also have to provide 
feedback to the walsender about what it has processed so the origin 
database knows what catalog snapshots can be expired.  Based on 
eyeballing pg_receivelog.c it seems that about half the code in the 700 
file is related to command line arguments etc, and the other half is 
related to looping over the copy out stream, sending feedback and other 
things that we would need to duplicate in slon.

pg_receivelog.c has  comment:

  * We have to use postgres.h not postgres_fe.h here, because there's so 
  * backend-only stuff in the XLOG include files we need.  But we need a
  * frontend-ish environment otherwise.    Hence this ugly hack.

This looks like more of a carryover from pg_receivexlog.c.  From what I 
can tell we can eliminate the postgres.h include if we also eliminate 
the utils/datetime.h and utils/timestamp.h and instead add in:

#include "postgres_fe.h"
#define POSTGRES_EPOCH_JDATE 2451545
#define UNIX_EPOCH_JDATE 2440588
#define SECS_PER_DAY 86400
#define USECS_PER_SEC INT64CONST(1000000)
typedef int64 XLogRecPtr;
#define InvalidXLogRecPtr 0

If there is a better way of getting these defines someone should speak 
up.   I recall that in the past slon actually did include postgres.h and 
it caused some issues (I think with MSVC win32 builds).  Since 
pg_receivelog.c will be used as a starting point/sample for third 
parties to write client programs it would be better if it didn't 
encourage client programs to include postgres.h

The Slony Output Plugin

Once we've modified slon to connect as a logical replication client we 
will need to write a slony plugin.

As I understand the plugin API:
* A walsender is processing through WAL records, each time it sees a 
COMMIT WAL record it will call my plugins
.change (for each change in the transaction)

* The plugin for a particular stream/replication client will see one 
transaction at a time passed to it in commit order.  It won't see 
.change(t1) followed by .change (t2), followed by a second .change(t1).  
The reorder buffer code hides me from all that complexity (yah)

 From a slony point of view I think the output of the plugin will be 
rows, suitable to be passed to COPY IN of the form:

origin_id, table_namespace,table_name,command_type, 

This is basically the Slony 2.2 sl_log format minus a few columns we no 
longer need (txid, actionseq).
command_args is a postgresql text array of column=value pairs.  Ie [ 

I don't t think our output plugin will be much more complicated than the 
test_decoding plugin.  I suspect we will want to give it the ability to 
filter out non-replicated tables.   We will also have to filter out 
change records that didn't originate on the local-node that aren't part 
of a cascaded subscription.  Remember that in a two node cluster  slony 
will have connections from A-->B  and from B--->A even if user tables 
only flow one way. Data that is replicated from A into B will show up in 
the WAL stream for B.

Exactly how we do this filtering is an open question,  I think the 
output plugin will at a minimum need to know:

a) What the slony node id is of the node it is running on.  This is easy 
to figure out if the output plugin is able/allowed to query its 
database.  Will this be possible? I would expect to be able to query the 
database as it exists now(at plugin invocation time) not as it existed 
in the past when the WAL was generated.   In addition to the node ID I 
can see us wanting to be able to query other slony tables 
(sl_table,sl_set etc...)

b) What the slony node id is of the node we are streaming too.   It 
would be nice if we could pass extra, arbitrary data/parameters to the 
output plugins that could include that, or other things.  At the moment 
the start_logical_replication rule in repl_gram.y doesn't allow for that 
but I don't see why we couldn't make it do so.

I still see some open questions about exactly how we would filter out 
data in this stage.

<editorial> Everything above deals with the postgresql side of things, 
ie the patch in question or the plugin API we would  have to work with.
Much of what is below deals with slony side change and might of limited 
interest to some on pgsql-hackers

Slon Applying Changes

The next task we will have is to make slon and the replica instance be 
able to apply these changes.  In slony 2.2 we do a COPY from sl_log and 
apply that stream to a table on the replica with COPY. We then have 
triggers on the replica that decode the command_args and apply the 
changes as
INSERT/UPDATE/DELETE statements on the user tables.   I see this 
continuing to work in this fashion, but there are a few special cases:

1) Changes made to sl_event on the origin will result in records in the 
logical replication stream that change sl_event.  In many cases we won't
just be inserting records into sl_event but we will need to instead do 
the logic in remote_worker.c for processing the different types of
events.  Worst case we could parse the change records we receive from 
our version pg_receivellog and split the sl_event records out into a 
sl_event stream and a sl_log stream.  Another approach might be to have 
the slony apply trigger build up a list of events that the slon 
remote_worker code can than process through.

2) Slony is normally bi-directional even if user data only replicates 
one way.  Confirm (sl_confirm) entries go from a replica back to an 
origin.  In a two node origin->replica scenario for data, the way I see 
this working is that the slon for the origin would connect to the 
replica (like it does today).
It would receive the logical replication records, but since it isn't 
subscribed to any tables it won't receive/process the WAL for 
user-tables but it will still receive/process sl_confirm rows.   It will 
then insert the rows in sl_confirm that it 'replicated' from the remote 

With what I have described so far, Slony would then be receiving a 
stream of events that look like

t1-insert into foo , [id=1, name='steve']
t1-insert into bar  [id=1, something='somethingelse']
t2- insert into foo [....]
t3- insert into sl_event [ev_type=SYNC, ev_origin=1,ev_seqno=12345]

Even though, from a data-correctness point of view, slony could commit 
the transaction on the replica after it sees the t1 commit, we won't 
want it to do commits other than on a SYNC boundary.  This means that 
the replicas will continue to move between consistent SYNC snapshots and 
that we can still track the state/progress of replication by knowing 
what events (SYNC or otherwise) have been confirmed.

This also means that slony should only provide  feedback to the 
walsender on SYNC boundaries after the transaction has committed on the 
receiver. I don't see this as being an issue.

Setting up Subscriptions
At first we have a slon cluster with just 1 node, life is good. When a 
second node is created and a path(or pair of paths) are defined between 
the nodes I think they will each:
1. Connect to the remote node with a normal libpq connection.
     a. Get the current xlog recptr,
     b. Query any non-sync events of interest from sl_event.
2. Connect to the remote node with a logical replication connection and 
start streaming logical replication changes start at the recptr we retrieved

Slon will then receive any future events from the remote sl_event as 
part of the logical replication stream.  It won't receive any user 
tables because it isn't yet subscribed to any.

When a subscription is started,  the SUBSCRIBE_SET and 
ENABLE_SUBSCRIPTION events will go through sl_event and the INSERT INTO 
sl_event will be part of a change record in the replication stream and 
be picked up by the subscribers slon remote_worker.

The remote_worker:copy_set will then need to get a consistent COPY of 
the tables in the replication set such that any changes made to the 
tables after the copy is started get included in the replication 
stream.  The approach proposed in the DESIGN.TXT file with exporting a 
snapshot sounds okay for this.    I *think* slony could get by with 
something less fancy as well but it would be ugly.

1. Make sure that the origin starts including change records for the 
tables in the set
2. have the slon(copy_set) wait until any transactions on the origin, 
that started prior to the ENABLE_SUBSCRIPTION, are committed.
      Slony does this today as part of the copy_set logic.
3.  Get/remember the snapshot visibility information for the COPY's 
4.  When we start to process change records we need to filter out 
records for transactions that were already visible by the copy.

Steps 1-3 are similar to how slony works today, but step 4 will be a bit 
awkward/ugly.  This isn't an issue today because we are already using 
the transaction visibility information for selecting from sl_log so it 
works, but above I had proposed stripping the xid from the logical 
change records.

Cascading Replication

The slon for B will insert records from A into B's tables.  This insert 
will generate WAL records on B.  The slon for C should be able to pull 
the data it needs  (both sl_event entries with ev_origin=A, and user 
table data originating on A) from B's logical replication stream.  I 
don't see any issues here nor do I see a need to 'cache' the data in an 
sl_log type of table on B.

Reshaping Replication

In Slony replication is reshaped by two types events, a MOVE SET and a 

Move Set:
A replication set might be subscribed in a cascaded fashion like

When a MOVE SET is issued node A will stop accepting new write 
transactions for tables in the set.  A MOVE_SET(1,A,B) event is then put 
into sl_event on node A. Node A will then stop accepting new 
transactions on the tables in set 1.
Node B receives the MOVE_SET command in the proper order, after it has 
processed the last SYNC generated on A when A was still accepting write 
transactions to those tables.  When Node B processes the MOVE_SET event 
then node B starts accepting write transactions on the tables.  Node B 
will also generates an ACCEPT_SET event. Node C will then receive the 
MOVE SET (ev_origin=A) and the ACCEPT_SET(ev_origin=B) command (after 
all SYNC events from A with data changes to the set) and then knows that 
it should start data on those tables from B.

I don't see any of this changing with logical replication acting as the 
data source.

|    .
v  .

Today with slony, if B is a valid failover target then it is a 
forwarding node of the set.  This means that B keeps a record in sl_log 
of any changes originating on A until B knows that node C has received 
those changes.  In the event of a failover, if node C is far behind, it 
can just get the missing data from sl_log on node B (the failover 
target/new origin).

I see a problem with what I have discussed above, B won't explicitly 
store the data from A in sl_log, a cascaded node would depend on B's WAL 
The problem is that at FAILOVER time,  B might have processed some 
changes from A. Node  C might also be processing Node B's WAL stream for 
events (or data from another set).  Node C will discard/not receive the 
data for A's tables since it isn't subscribed to those tables from B.  
What happens then if at some later point B and C receive the FAILOVER event.
What does node C do? It can't get the missing data from node A because 
node A has failed, and it can't get it from node B because node C has 
already processed the WAL changes from node B that included the data but 
it ignored/discarded it.  Maybe node C could reprocess older WAL from 
node B? Maybe this forces us to keep an sl_log type structure around?

Is it complete enough to build a prototype?
I think so, the incomplete areas I see are the ones that mentioned in 
the patch submission including:
* Snapshot exporting for the initial COPY
* Spilling the reorder buffer to disk

I think it would be possible to build a prototype without those even 
though we'd need them before I could build a production system.

I like this design much better than the original design from the spring 
that would have required keeping a catalog proxy on the decoding 
machine.  Based on what I've seen it should be possible to make slony 
use logical replication as a source for events instead of triggers 
populating sl_log.
My thinking is that we want a way for logreceiver programs to pass 
arguments/parameters to the output plugins. Beyond that this looks like 
something slony can use.

In response to


pgsql-hackers by date

Next:From: Jeff JanesDate: 2012-11-18 05:24:02
Subject: Re: Do we need so many hint bits?
Previous:From: Jeff DavisDate: 2012-11-18 02:20:33
Subject: Re: Do we need so many hint bits?

Privacy Policy | About PostgreSQL
Copyright © 1996-2018 The PostgreSQL Global Development Group