RE: Initial Schema Sync for Logical Replication

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Euler Taveira <euler(at)eulerto(dot)com>, "Kumar, Sachin" <ssetiya(at)amazon(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: RE: Initial Schema Sync for Logical Replication
Date: 2023-03-22 07:52:32
Message-ID: OS0PR01MB57165D2784D05E2BCF7B284994869@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wednesday, March 22, 2023 1:16 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Mar 22, 2023 at 8:29 AM Masahiko Sawada
> <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Mar 21, 2023 at 8:18 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > > On Tue, Mar 21, 2023 at 7:32 AM Euler Taveira <euler(at)eulerto(dot)com> wrote:
> > >
> > > > You should
> > > > exclude them removing these objects from the TOC before running
> > > > pg_restore or adding a few pg_dump options to exclude these
> > > > objects. Another issue is related to different version. Let's say
> > > > the publisher has a version ahead of the subscriber version, a new
> > > > table syntax can easily break your logical replication setup. IMO
> > > > pg_dump doesn't seem like a good solution for initial synchronization.
> > > >
> > > > Instead, the backend should provide infrastructure to obtain the
> > > > required DDL commands for the specific (set of) tables. This can
> > > > work around the issues from the previous paragraph:
> > > >
> > > ...
> > > > * don't need to worry about different versions.
> > > >
> > >
> > > AFAICU some of the reasons why pg_dump is not allowed to dump from
> > > the newer version are as follows: (a) there could be more columns in
> > > the newer version of the system catalog and then Select * type of
> > > stuff won't work because the client won't have knowledge of
> > > additional columns. (b) the newer version could have new features
> > > (represented by say new columns in existing catalogs or new
> > > catalogs) that the older version of pg_dump has no knowledge of and
> > > will fail to get that data and hence an inconsistent dump. The
> > > subscriber will easily be not in sync due to that.
> > >
> > > Now, how do we avoid these problems even if we have our own version
> > > of functionality similar to pg_dump for selected objects? I guess we
> > > will face similar problems.
> >
> > Right. I think that such functionality needs to return DDL commands
> > that can be executed on the requested version.
> >
> > > If so, we may need to deny schema sync in any such case.
> >
> > Yes. Do we have any concrete use case where the subscriber is an older
> > version, in the first place?
> >
>
> As per my understanding, it is mostly due to the reason that it can work today.
> Today, during an off-list discussion with Jonathan on this point, he pointed me
> to a similar incompatibility in MySQL replication. See the "SQL
> incompatibilities" section in doc[1]. Also, please note that this applies not only
> to initial sync but also to schema sync during replication. I don't think it would
> be feasible to keep such cross-version compatibility for DDL replication.
>
> Having said above, I don't intend that we must use pg_dump from the
> subscriber for the purpose of initial sync. I think the idea at this stage is to
> primarily write a POC patch to see what difficulties we may face. The other
> options that we could try out are (a) try to duplicate parts of pg_dump code in
> some way (by extracting required
> code) for the subscription's initial sync, or (b) have a common code (probably
> as a library or some other way) for the required functionality. There could be
> more possibilities that we may not have thought of yet. But the main point is
> that for approaches other than using pg_dump, we should consider ways to
> avoid duplicity of various parts of its code. Due to this, I think before ruling out
> using pg_dump, we should be clear about its risks and limitations.

I thought about some possible problems about the design of using pg_dump.

1) According to the design, it will internally call pg_dump when creating
subscription, but it requires to use a powerful user when calling pg_dump.
Currently, it may not be a problem because create subscription also requires
superuser. But people have recently discussed about allowing non-superuser to
create the subscription[1], if that is accepted, then it seems not great to
internally use superuser to call pg_dump while the user creating the
subscription is a non-super user.

2) I think it's possible that some cloud DB service doesn't allow user to use
the client commands(pg_dump ,..) directly, and the user that login in the
database may not have the permission to execute the client commands.

[1] https://www.postgresql.org/message-id/flat/20230308194743.23rmgjgwahh4i4rg%40awork3.anarazel.de

Best Regards,
Hou zj

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message shiy.fnst@fujitsu.com 2023-03-22 08:08:22 RE: Dropped and generated columns might cause wrong data on subs when REPLICA IDENTITY FULL
Previous Message David Rowley 2023-03-22 07:50:28 Re: Comment in preptlist.c