Skip site navigation (1) Skip section navigation (2)

Re: logical changeset generation v3 - Source for Slony

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Steve Singer <steve(at)ssinger(dot)info>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: logical changeset generation v3 - Source for Slony
Date: 2012-11-18 16:07:37
Message-ID: 20121118160737.GA5091@awork2.anarazel.de (view raw, whole thread or download thread mbox)
Thread:
Lists: pgsql-hackers
Hi Steve!

On 2012-11-17 22:50:35 -0500, Steve Singer wrote:
> First, you can add me to the list of people saying 'wow', I'm impressed.

Thanks!

> The approach I am taking to reviewing this to try and answer the following
> question
>
> 1) How might a future version of slony be able to use logical replication as
> described by your patch and design documents
> and what would that look like.
>
> 2) What functionality is missing from the patch set that would stop me from
> implementing or prototyping the above.

Sounds like a good plan to me.

>
> Connecting slon to the remote postgresql
> ========================
>
> Today the slony remote listener thread queries a bunch of events from
> sl_event for a batch of SYNC events. Then the remote helper thread queries
> data from sl_log_1 and sl_log_2.    I see this changing, instead the slony
> remote listener thread would connect to the remote system and get a logical
> replication stream.
>
>   1) Would slony connect as a normal client connection and call something
> like 'select pg_slony_process_xlog(...)' to get bunch of logical replication
>       change records to process.
>   OR
>   2) Would slony connect as a replication connection similar to how the
> pg_receivelog program does today and then process the logical changeset
>       outputs.  Instead of writing it to a file (as pg_receivelog does)

It would need to be the latter. We need the feedback messages it sends
for several purposes:
- increasing the lowered xmin
- implementing optionally synchronous replication at some point
- using 1) would mean having transactions open...

> It seems that the second approach is what is encouraged.  I think we would
> put a lot of the pg_receivelog functionality into slon and it would issue a
> command like 'INIT_LOGICAL_REPLICATION 'slony') to use the slony logical
> replication plugin.  Slon would also have to provide feedback to the
> walsender about what it has processed so the origin database knows what
> catalog snapshots can be expired.  Based on eyeballing pg_receivelog.c it
> seems that about half the code in the 700 file is related to command line
> arguments etc, and the other half is related to looping over the copy out
> stream, sending feedback and other things that we would need to duplicate in
> slon.

I think we should provide some glue code to do this, otherwise people
will start replicating all the bugs I hacked into this... More
seriously: I think we should have support code here, no user will want
to learn the intracacies of feedback messages and such. Where that would
live? No idea.

> pg_receivelog.c has  comment:

(its pg_receivellog btw. ;))

>
> /*
>  * We have to use postgres.h not postgres_fe.h here, because there's so much
>  * backend-only stuff in the XLOG include files we need.  But we need a
>  * frontend-ish environment otherwise.    Hence this ugly hack.
>  */
>
> This looks like more of a carryover from pg_receivexlog.c.  From what I can
> tell we can eliminate the postgres.h include if we also eliminate the
> utils/datetime.h and utils/timestamp.h and instead add in:
>
> #include "postgres_fe.h"
> #define POSTGRES_EPOCH_JDATE 2451545
> #define UNIX_EPOCH_JDATE 2440588
> #define SECS_PER_DAY 86400
> #define USECS_PER_SEC INT64CONST(1000000)
> typedef int64 XLogRecPtr;
> #define InvalidXLogRecPtr 0
>
> If there is a better way of getting these defines someone should speak up.
> I recall that in the past slon actually did include postgres.h and it caused
> some issues (I think with MSVC win32 builds).  Since pg_receivelog.c will be
> used as a starting point/sample for third parties to write client programs
> it would be better if it didn't encourage client programs to include
> postgres.h

I wholeheartedly aggree. It should also be cleaned up a fair bit before
others copy it should we not go for having some client side library.

Imo the library could very roughly be something like:

state = SetupStreamingLLog(replication-slot, ...);
while((message = StreamingLLogNextMessage(state))
{
     write(outfd, message->data, message->length);
     if (received_100_messages)
     {
          fsync(outfd);
          StreamingLLogConfirm(message);
     }
}

Although I guess thats not good enough because StreamingLLogNextMessage
would be blocking, but that shouldn't be too hard to work around.


> The Slony Output Plugin
> =====================
>
> Once we've modified slon to connect as a logical replication client we will
> need to write a slony plugin.
>
> As I understand the plugin API:
> * A walsender is processing through WAL records, each time it sees a COMMIT
> WAL record it will call my plugins
> .begin
> .change (for each change in the transaction)
> .commit
>
> * The plugin for a particular stream/replication client will see one
> transaction at a time passed to it in commit order.  It won't see
> .change(t1) followed by .change (t2), followed by a second .change(t1).  The
> reorder buffer code hides me from all that complexity (yah)

Correct.

> From a slony point of view I think the output of the plugin will be rows,
> suitable to be passed to COPY IN of the form:
>
> origin_id, table_namespace,table_name,command_type,
> cmd_updatencols,command_args
>
> This is basically the Slony 2.2 sl_log format minus a few columns we no
> longer need (txid, actionseq).
> command_args is a postgresql text array of column=value pairs.  Ie [
> {id=1},{name='steve'},{project='slony'}]

It seems to me that that makes escaping unneccesarily complicated, but
given you already have all the code... ;)

> I don't t think our output plugin will be much more complicated than the
> test_decoding plugin.

Good. Thats the idea ;). Are you ok with the interface as it is now or
would you like to change something?

> I suspect we will want to give it the ability to
> filter out non-replicated tables.   We will also have to filter out change
> records that didn't originate on the local-node that aren't part of a
> cascaded subscription.  Remember that in a two node cluster  slony will have
> connections from A-->B  and from B--->A even if user tables only flow one
> way. Data that is replicated from A into B will show up in the WAL stream
> for B.

Yes. We will also need something like that. If you remember the first
prototype we sent to the list, it included the concept of an
'origin_node' in wal record. I think you actually reviewed that one ;)

That was exactly aimed at something like this...

Since then my thoughts about how the origin_id looks like have changed a
bit:
- origin id is internally still represented as an uint32/Oid
  - never visible outside of wal/system catalogs
- externally visible it gets
  - assigned an uuid
  - optionally assigned a user defined name
- user settable (permissions?) origin when executing sql:
  - SET change_origin_uuid = 'uuid';
  - SET change_origin_name = 'user-settable-name';
  - defaults to the local node
- decoding callbacks get passed the origin of a change
  - txn->{origin_uuid, origin_name, origin_internal?}
- the init decoding callback can setup an array of interesting origins,
  so the others don't even get the ReorderBuffer treatment

I have to thank the discussion on -hackers and a march through prague
with Marko here...

> Exactly how we do this filtering is an open question,  I think the output
> plugin will at a minimum need to know:
>
> a) What the slony node id is of the node it is running on.  This is easy to
> figure out if the output plugin is able/allowed to query its database.  Will
> this be possible? I would expect to be able to query the database as it
> exists now(at plugin invocation time) not as it existed in the past when the
> WAL was generated.   In addition to the node ID I can see us wanting to be
> able to query other slony tables (sl_table,sl_set etc...)

Hm. There is no fundamental reason not to allow normal database access
to the current database but it won't be all that cheap, so doing it
frequently is not a good idea.
The reason its not cheap is that you basically need to teardown the
postgres internal caches if you switch the timestream in which you are
working.

Would go something like:

TransactionContext = AllocSetCreate(...);
RevertFromDecodingSnapshot();
InvalidateSystemCaches();
StartTransactionCommand();
/* do database work */
CommitTransactionCommand();
/* cleanup memory*/
SetupDecodingSnapshot(snapshot, data);
InvalidateSystemCaches();

Why do you need to be able to query the present? I thought it might be
neccesary to allow additional tables be accessed in a timetraveling
manner, but not this way round.
I guess an initial round of querying during plugin initialization won't
be good enough?

> b) What the slony node id is of the node we are streaming too.   It would be
> nice if we could pass extra, arbitrary data/parameters to the output plugins
> that could include that, or other things.  At the moment the
> start_logical_replication rule in repl_gram.y doesn't allow for that but I
> don't see why we couldn't make it do so.

Yes, I think we want something like that. I even asked input on that
recently ;):
http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de

Input welcome!


> Even though, from a data-correctness point of view, slony could commit the
> transaction on the replica after it sees the t1 commit, we won't want it to
> do commits other than on a SYNC boundary.  This means that the replicas will
> continue to move between consistent SYNC snapshots and that we can still
> track the state/progress of replication by knowing what events (SYNC or
> otherwise) have been confirmed.

I don't know enough about slony internals, but: why? This will prohibit
you from ever doing (per-transaction) synchronous replication...

> This also means that slony should only provide  feedback to the walsender on
> SYNC boundaries after the transaction has committed on the receiver. I don't
> see this as being an issue.

Yes, thats no problem. You need to give feedback more frequently
(otherwise walsender kicks you off), but you don't have to increase the
confirmed flush location.

> Setting up Subscriptions
> ===================
> At first we have a slon cluster with just 1 node, life is good. When a
> second node is created and a path(or pair of paths) are defined between the
> nodes I think they will each:
> 1. Connect to the remote node with a normal libpq connection.
>     a. Get the current xlog recptr,
>     b. Query any non-sync events of interest from sl_event.
> 2. Connect to the remote node with a logical replication connection and
> start streaming logical replication changes start at the recptr we retrieved
>     above.

Note that INIT_LOGICAL_REPLICATION can take some time to get to the
initial consistent state (especially if there are longrunning
transactions). So you should do the init in 1), query all the events in
the snapshot that returns and then go over to 2).

> The remote_worker:copy_set will then need to get a consistent COPY of the
> tables in the replication set such that any changes made to the tables after
> the copy is started get included in the replication stream.  The approach
> proposed in the DESIGN.TXT file with exporting a snapshot sounds okay for
> this.    I *think* slony could get by with something less fancy as well but
> it would be ugly.

The snapshot exporting isn't really that much additional work as we
already need to support most of it for keeping state across restarts.

> FAILOVER:
> ---------------
> A---->B
> |    .
> v  .
> C
>
> Today with slony, if B is a valid failover target then it is a forwarding
> node of the set.  This means that B keeps a record in sl_log of any changes
> originating on A until B knows that node C has received those changes.  In
> the event of a failover, if node C is far behind, it can just get the
> missing data from sl_log on node B (the failover target/new origin).
>
> I see a problem with what I have discussed above, B won't explicitly store
> the data from A in sl_log, a cascaded node would depend on B's WAL stream.
> The problem is that at FAILOVER time,  B might have processed some changes
> from A. Node  C might also be processing Node B's WAL stream for events (or
> data from another set).  Node C will discard/not receive the data for A's
> tables since it isn't subscribed to those tables from B.  What happens then
> if at some later point B and C receive the FAILOVER event.
> What does node C do? It can't get the missing data from node A because node
> A has failed, and it can't get it from node B because node C has already
> processed the WAL changes from node B that included the data but it
> ignored/discarded it.  Maybe node C could reprocess older WAL from node B?
> Maybe this forces us to keep an sl_log type structure around?

I fear youve left me behind here, sorry, can't give you any input.

> Is it complete enough to build a prototype?
> ==========================
> I think so, the incomplete areas I see are the ones that mentioned in the
> patch submission including:
> * Snapshot exporting for the initial COPY
> * Spilling the reorder buffer to disk
>
> I think it would be possible to build a prototype without those even though
> we'd need them before I could build a production system.

> Conclusions
> =============
> I like this design much better than the original design from the spring that
> would have required keeping a catalog proxy on the decoding machine.  Based
> on what I've seen it should be possible to make slony use logical
> replication as a source for events instead of triggers populating sl_log.
> My thinking is that we want a way for logreceiver programs to pass
> arguments/parameters to the output plugins. Beyond that this looks like
> something slony can use.
>
>

Cool!

Don't hesitate to mention anything that you think would make you life
easier, chances are that youre not the only one who could benefit from
it...


Thanks,

Andres

--
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


In response to

Responses

pgsql-hackers by date

Next:From: Andres FreundDate: 2012-11-18 16:18:35
Subject: Re: [PATCH 05/14] Add a new relmapper.c function RelationMapFilenodeToOid that acts as a reverse of RelationMapOidToFilenode
Previous:From: Erik RijkersDate: 2012-11-18 15:43:38
Subject: 9.3 pg_archivecleanup broken?

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group