RE: Initial Schema Sync for Logical Replication

From: "Kumar, Sachin" <ssetiya(at)amazon(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Initial Schema Sync for Logical Replication
Date: 2023-04-03 06:53:56
Message-ID: f9ac837b4dec43548162f1ec24880d06@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
> > > I was thinking each TableSync process will call pg_dump --table,
> > > This way if we have N tableSync process, we can have N pg_dump --
> table=table_name called in parallel.
> > > In fact we can use --schema-only to get schema and then let COPY
> > > take care of data syncing . We will use same snapshot for pg_dump as well
> as COPY table.
> >
> > How can we postpone creating the pg_subscription_rel entries until the
> > tablesync worker starts and does the schema sync? I think that since
> > pg_subscription_rel entry needs the table OID, we need either to do
> > the schema sync before creating the entry (i.e, during CREATE
> > SUBSCRIPTION) or to postpone creating entries as Amit proposed[1]. The
> > apply worker needs the information of tables to sync in order to
> > launch the tablesync workers, but it needs to create the table schema
> > to get that information.
>
> For the above reason, I think that step 6 of the initial proposal won't work.
>
> If we can have the tablesync worker create an entry of pg_subscription_rel after
> creating the table, it may give us the flexibility to perform the initial sync. One
> idea is that we add a relname field to pg_subscription_rel so that we can create
> entries with relname instead of OID if the table is not created yet. Once the
> table is created, we clear the relname field and set the OID of the table instead.
> It's not an ideal solution but we might make it simpler later.
>
> Assuming that it's feasible, I'm considering another approach for the initial sync
> in order to address the concurrent DDLs.
>
> The basic idea is to somewhat follow how pg_dump/restore to dump/restore
> the database data. We divide the synchronization phase (including both schema
> and data) up into three phases: pre-data, table-data, post-data. These mostly
> follow the --section option of pg_dump.
>
> 1. The backend process performing CREATE SUBSCRIPTION creates the
> subscription but doesn't create pg_subscription_rel entries yet.
>
> 2. Before starting the streaming, the apply worker fetches the table list from the
> publisher, create pg_subscription_rel entries for them, and dumps+restores
> database objects that tables could depend on but don't depend on tables such as
> TYPE, OPERATOR, ACCESS METHOD etc (i.e.
> pre-data).

We will not have slot starting snapshot, So somehow we have to get a new snapshot
And skip all the wal_log between starting of slot and snapshot creation lsn ? .

>
> 3. The apply worker launches the tablesync workers for tables that need to be
> synchronized.
>
> There might be DDLs executed on the publisher for tables before the tablesync
> worker starts. But the apply worker needs to apply DDLs for pre-data database
> objects. OTOH, it can ignore DDLs for not-synced-yet tables and other database
> objects such as INDEX, TRIGGER, RULE, etc (i.e. post-data).
>
> 4. The tablesync worker creates its replication slot, dumps+restores the table
> schema, update the pg_subscription_rel, and perform COPY.
>
> These operations should be done in the same transaction.

pg_restore wont be rollbackable, So we need to maintain states in pg_subscription_rel.

>
> 5. After finishing COPY, the tablesync worker dumps indexes (and perhaps
> constraints) of the table and creates them (which possibly takes a long time).
> Then it starts to catch up, same as today. The apply worker needs to wait for the
> tablesync worker to catch up.

I don’t think we can have CATCHUP stage. We can have a DDL on publisher which
can add a new column (And this DDL will be executed by applier later). Then we get a INSERT
because we have old definition of table, insert will fail.

>
> We need to repeat these steps until we complete the initial data copy and create
> indexes for all tables, IOW until all pg_subscription_rel status becomes READY.
>
> 6. If the apply worker confirms all tables are READY, it starts another sync
> worker who is responsible for the post-data database objects such as TRIGGER,
> RULE, POLICY etc (i.e. post-data).
>
> While the sync worker is starting up or working, the apply worker applies
> changes for pre-data database objects as well as READY tables.
We might have some issue if we have create table like
Create table_name as select * from materialized_view.
>
> 7. Similar to the tablesync worker, this sync worker creates its replication slot
> and sets the returned LSN somewhere, say pg_subscription.
>
> 8. The sync worker dumps and restores these objects. Which could take a time
> since it would need to create FK constraints. Then it starts to catch up if the
> apply worker is ahead. The apply worker waits for the sync worker to catch up.
>
> 9. Once the sync worker catches up, the apply worker starts applying changes
> for all database objects.
>
> IIUC with this approach, we can resolve the concurrent DDL problem Sachin
> mentioned, and indexes (and constraints) are created after the initial data copy.
>
> The procedures are still very complex and not fully considered yet but I hope
> there are some useful things at least for discussion.
>
> Probably we can start with supporting only tables. In this case, we would not
> need the post-data phase (i.e. step 6-9). It seems to me that we need to have
> the subscription state somewhere (maybe
> pg_subscription) so that the apply worker figure out the next step.
> Since we need to dump and restore different objects on different timings, we
> probably cannot directly use pg_dump/pg_restore. I've not considered how the
> concurrent REFRESH PUBLICATION works.

I think above prototype will work and will have least amount of side effects, but
It might be too complex to implement and I am not sure about corner cases.

I was thinking of other ways of doing Initial Sync , which are less complex but each
with separate set of bottlenecks

On Publisher Side:-
1) Locking the publisher:- Easiest one to implement, applier process will get Access Shared
lock on the all the published tables. (We don't have to worry newly created concurrent table)
As tableSync will finish syncing the table, it will release table lock, So we will release
table locks in steps. Users can still perform DML on tables, but DDLs wont be allowed.

On Subscriber Side:-
So the main issue is tableSync process can see the future table data/version wrt to the
applier process, So we have to find a way to ensure that tableSync/applier process sees
same table version.

2) Using pg_dump/pg_restore for schema and data:- As Amit mentioned we can use pg_dump/
pg_restore [1], Although it might have side effect of using double storage , we can
table pg_dump of each table separately and delete the dump as soon as table is synced.
tableSync process will read the dump and call pg_restore on the table.
If we crash in middle of restoring the tables we can start pg_dump(--clean)/restore again
with left out tables.
With this we can reduce space usage but we might create too many files.

3) Using publisher snapshot:- Applier process will do pg_dump/pg_restore as usual,
Then applier process will start a new process P1 which will connect to
publisher and start a transaction , it will call pg_export_snapshot() to export the
snapshot.Then applier process will take snapshot string and pass it to the tableSync process
as a argument. tableSync will use this snapshot for COPY TABLE. tableSync should only
do COPY TABLE and then will exit , So we wont do any catchup phase in tableSync. After
all tables finish COPY table transaction will be committed by P1 process and it will exit.
In the case of crash/restart we can simple start from beginning since nothing is committed
till every table is synced. There are 2 main issues with this approach
1. I am not sure what side-effects we might have on publisher since we might have to keep
the transaction open for long time.
2. Applier process will simple wait till all tables are synced.
since applier process wont be able to apply any wal_logs till all tables are synced
maybe instead of creating new process Applier process itself can start transaction/
export snapshot and tableSync process will use that snapshot. After all tables are synced
it can start wal_streaming.

I think approach no 3 might be the best way.

[1] https://www.postgresql.org/message-id/CAA4eK1Ld9-5ueomE_J5CA6LfRo%3DwemdTrUp5qdBhRFwGT%2BdOUw%40mail.gmail.com

Regards
Sachin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2023-04-03 07:09:50 Re: Pass heaprel to GlobalVisTestFor() in vacuumRedirectAndPlaceholder()
Previous Message Amit Kapila 2023-04-03 06:35:17 Re: Minimal logical decoding on standbys