Logical replication existing data copy

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Logical replication existing data copy
Date: 2016-12-19 09:30:07
Message-ID: 3caaecb9-6c1e-7824-4004-97b89b449096@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

as the logical replication patch is progressing closer to integration it
seems to be good time to revisit what to do about preexisting data.

This patch implements data synchronization for the logical replication.
It works both for initial setup of subscription as well as for tables
added later to replication when the subscription is already active.

There is new catalog pg_subscription_rel which track
synchronization/replication state of each table that is subscribed via
publications.

From the user perspective it adds some additional options/clauses to
CREATE/ALTER SUBSCRIPTION namely:
- CREATE SUBSCRIPTION ... [ WITH (COPY DATA | NOCOPY DATA) ] where user
can specify if initial data should be copied or not, with the default COPY.
- ALTER SUBSCRIPTION ... REFRESH PUBLICATION [ WITH (COPY DATA | NOCOPY
DATA) ] which updates the local info about tables in the subscribed
publication again with option to either copy or not copy data.
- Also this WITH (COPY DATA | NOCOPY DATA) ] was added to ALTER
SUBSCRIPTION ... SET PUBLICATION which automatically triggers the
REFRESH PUBLICATION as well.

So existing table data can be copied once subscription is created, but
also new tables can be added and their data copied. This is where the
REFRESH PUBLICATION comes into play. Adding table to publication does
not make it automatically replicated by the subscription as the
subscription does not have tracking info for that table. So to add new
table user must call ALTER SUBSCRIPTION ... REFRESH PUBLICATION on
subscriber otherwise the data won't be replicated.

The copy proccess runs in parallel with normal replication and is
visible in the pg_stat_subscription view. There is also new GUC
max_subscription_sync_workers which specifies how many tables can be
synchronized in parallel.

Implementation-wise this adds modified apply process called sync
(tablesync.c) which at the beginning creates temporary slot, copies data
from it, then synces to the correct LSN with the main apply and exits
(this is actually quite complex but it's explained in the tablesync.c
header).

Standard COPY TO/FROM is used. This is done by enhancing COPY FROM to
accept data from callback function which the tablesync implements.

On the publisher side, the walsender is enhanced to accept standard SQL
(simple query protocol only) so that COPY can be run. This also helps
getting info about table using plain SELECT. The way the SQL works is
that the parser was changed to accept strings that are not valid
replication protocol commands and if such string comes it's sent to
exec_simple_query instead of walsender.

The attached patch applies on top of logical replication patch-set v14.
I will add this as separate entry to the CF.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Logical-replication-support-for-initial-data-copy.patch text/x-diff 141.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Beena Emerson 2016-12-19 09:44:50 Re: increasing the default WAL segment size
Previous Message Magnus Hagander 2016-12-19 09:17:47 Re: Slow I/O can break throttling of base backup