Re: Initial Schema Sync for Logical Replication

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Euler Taveira <euler(at)eulerto(dot)com>, "Kumar, Sachin" <ssetiya(at)amazon(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
Subject: Re: Initial Schema Sync for Logical Replication
Date: 2023-03-22 13:47:10
Message-ID: CAD21AoBeHERZFdK1mrU5G5doKA3iY8COaLD0ViZJzBYoePSRzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 22, 2023 at 2:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler(at)eulerto(dot)com> wrote:
> > >
> > > > You should
> > > > exclude them removing these objects from the TOC before running pg_restore or
> > > > adding a few pg_dump options to exclude these objects. Another issue is related
> > > > to different version. Let's say the publisher has a version ahead of the
> > > > subscriber version, a new table syntax can easily break your logical
> > > > replication setup. IMO pg_dump doesn't seem like a good solution for initial
> > > > synchronization.
> > > >
> > > > Instead, the backend should provide infrastructure to obtain the required DDL
> > > > commands for the specific (set of) tables. This can work around the issues from
> > > > the previous paragraph:
> > > >
> > > ...
> > > > * don't need to worry about different versions.
> > > >
> > >
> > > AFAICU some of the reasons why pg_dump is not allowed to dump from the
> > > newer version are as follows: (a) there could be more columns in the
> > > newer version of the system catalog and then Select * type of stuff
> > > won't work because the client won't have knowledge of additional
> > > columns. (b) the newer version could have new features (represented by
> > > say new columns in existing catalogs or new catalogs) that the older
> > > version of pg_dump has no knowledge of and will fail to get that data
> > > and hence an inconsistent dump. The subscriber will easily be not in
> > > sync due to that.
> > >
> > > Now, how do we avoid these problems even if we have our own version of
> > > functionality similar to pg_dump for selected objects? I guess we will
> > > face similar problems.
> >
> > Right. I think that such functionality needs to return DDL commands
> > that can be executed on the requested version.
> >
> > > If so, we may need to deny schema sync in any such case.
> >
> > Yes. Do we have any concrete use case where the subscriber is an older
> > version, in the first place?
> >
>
> As per my understanding, it is mostly due to the reason that it can
> work today. Today, during an off-list discussion with Jonathan on this
> point, he pointed me to a similar incompatibility in MySQL
> replication. See the "SQL incompatibilities" section in doc[1]. Also,
> please note that this applies not only to initial sync but also to
> schema sync during replication. I don't think it would be feasible to
> keep such cross-version compatibility for DDL replication.

Makes sense to me.

> Having said above, I don't intend that we must use pg_dump from the
> subscriber for the purpose of initial sync. I think the idea at this
> stage is to primarily write a POC patch to see what difficulties we
> may face. The other options that we could try out are (a) try to
> duplicate parts of pg_dump code in some way (by extracting required
> code) for the subscription's initial sync, or (b) have a common code
> (probably as a library or some other way) for the required
> functionality. There could be more possibilities that we may not have
> thought of yet. But the main point is that for approaches other than
> using pg_dump, we should consider ways to avoid duplicity of various
> parts of its code. Due to this, I think before ruling out using
> pg_dump, we should be clear about its risks and limitations.
>
> Thoughts?
>

Agreed. My biggest concern about approaches other than using pg_dump
is the same; the code duplication that could increase the maintenance
costs. We should clarify what points of using pg_dump is not a good
idea, and also analyze alternative ideas in depth.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2023-03-22 13:58:58 Re: CREATE DATABASE ... STRATEGY WAL_LOG issues
Previous Message torikoshia 2023-03-22 13:34:20 Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)