| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Support logical replication of DDLs, take2 |
| Date: | 2026-06-11 07:07:26 |
| Message-ID: | CAD21AoCEVO+zpLmQqKwZbJ5+rvqsJ1e0wnTNBH437p8tDw7B=g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general pgsql-hackers |
On Tue, May 5, 2026 at 10:36 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, May 4, 2026 at 5:23 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, May 1, 2026 at 2:11 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Apr 29, 2026 at 9:44 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Wed, Apr 29, 2026 at 3:19 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Mon, Apr 27, 2026 at 11:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > > >
> > > > > > Yes, there will be a maintenance cost of JSON-based deparsing
> > > > > > approach. But note that multiple senior people (Alvaro Herrera, Robert
> > > > > > Haas) [1] seems to favor that approach. So, I am not sure we can
> > > > > > conclude to abandon that approach without those people or some other
> > > > > > senior people agreeing to abandon it. To be clear, I am not against
> > > > > > considering a new/different approach for DDL replication but just that
> > > > > > it is not clear that old/existing approach can be ruled out without
> > > > > > more discussion on it,
> > > > >
> > > > > Thank you for pointing it out. Just to be clear, IIUC what they liked
> > > > > was to use JSON string representation of DDLs, but not JSON string
> > > > > representation of DDLs that are deparsed from parse nodes, no?
> > > > >
> > > >
> > > > As per my understanding, we built deparsing stuff with a goal of
> > > > supporting DDL replication and Alvaro was the original author of that
> > > > work, see [1]. The benefit it provides flexibility in terms of
> > > > filtering by decoding plugin, if any, or changing the DDL (like
> > > > schema-mapping) during apply. It is not clear to me if we can achive
> > > > similar level of flexibility with other approach.
> > >
> > > I think we can generate the same JSON-string representation of DDLs
> > > from catalog information, it would also require a lot of code, though.
> > > It would be independent from parse nodes and if we implement it as an
> > > option for pg_get_xxx_ddl() functionality it would be able to be
> > > reused by other tools too.
> > >
> >
> > IIRC, this was discussed previously as well but we were not sure if we
> > can build all (especially some complex ones) without parsetree. See
> > discussion/emails around [1][2].
>
> Right. We would need parsetree somewhat. I think we're able to
> generate CREATE and DROP TABLE statements for the particular table
> without parsetree. But as for generating CREATE TABLE for the table,
> it's going to be a combination of CREATE/ALTER TABLE/INDEX/SEQUENCE
> statements, like pg_dump does. For instance, if a user executes
> "CREATE TABLE foo (id serial primary key)", we create table, sequence,
> and index, but searching system catalogs doesn't tell us these objects
> are created in one statement. So we would generate multiple DDLs as
> follow:
>
> CREATE TABLE public.foo (id integer NOT NULL);
> CREATE SEQUENCE public.foo_id_seq AS integer ...;
> ALTER SEQUENCE public.foo OWNED BY foo;
> ALTER TABLE public.foo ADD CONSTRAINT foo_pkey PRIMARY KEY (id);
>
> While these queries create the same table as the one on the publisher,
> we need to consider whether it's okay to replicate these queries
> instead of the oen statement originally executed on the publisher. If
> we can use something like pg_get_table_ddl() in DDL replication, that
> function would be able to be used also by the initial schema
> synchronization.
>
> As for ALTER TABLE, we would need parsetree of ALTER TABLE subcommands.
>
> >
> > > >
> > > > >
> > > > > >
> > > > > > We would need to maintain the JSON serialization code whenever
> > > > > > > creating or modifying parse nodes, regardless of whether the changes
> > > > > > > were related to DDL replication. IIUC, this was the primary reason the
> > > > > > > feature didn't cross the finish line.
> > > > > > >
> > > > > > > Additionally, I think there is another design issue: it is not
> > > > > > > output-plugin agnostic. Since the deparsed DDL was written by a
> > > > > > > logical-replication-specific event trigger, third-party logical
> > > > > > > decoding plugins cannot easily detect DDL events.
> > > > > > >
> > > > > >
> > > > > > Why RmgrId like RM_LOGICALDDLMSG_ID and XLOG_LOGICAL_DDL_MESSAGE wal
> > > > > > info is not sufficient for this? Decoder will add a message like
> > > > > > REORDER_BUFFER_CHANGE_DDL which can be used to detect DDL message, no?
> > > > >
> > > > > Right, but I'm not sure this is a good developer experience that
> > > > > additional steps are required to capture DDL events for other plugins
> > > > > while changes of INSERT/UPDATE/DELETE/TRUNCATE are passed from the
> > > > > logical decoding by default.
> > > > >
> > > >
> > > > Yes, there could probably be additional steps for plugins but they
> > > > must be doing a few things already which are defined at publication
> > > > level like column lists, row filtering, something related to RI, etc.
> > >
> > > I think those publication-level features operate at a somewhat
> > > different layer than the fundamental mechanism of capturing DDL
> > > events. Plugins filter rows or columns based on configuration, but the
> > > logical decoding itself guarantees that the DML events are reliably
> > > passed to them. Given that the TRUNCATE in logical replication already
> > > works so, I guess DDL should have the same fundamental guarantee.
> > >
> > > it's unclear to me how plugins could reliably manage these event
> > > triggers. While a plugin might create an event trigger during the
> > > startup callback if it doesn't exist, it cannot drop it during the
> > > shutdown callback. We also cannot establish a dependency between an
> > > event trigger and a logical replication slot. We would likely need to
> > > invent a new plugin callback specifically invoked at slot drop time
> > > just to clean it up. Also, if different plugins want to capture DDL
> > > events, they could end up registering different event triggers,
> > > emitting multiple DDL WAL records for the same DDL event.
> > >
> >
> > Why would different plugins end up registering different event
> > triggers? I mean if they are already registered by the first plugin
> > what is the need to re-register.
> >
>
> Since you mentioned column lists and row filtering as examples of what
> individual plugins already do in a reply to my point that registering
> event triggers could be an additional step for other plugins, I
> thought you meant that each plugin registering event triggers is not a
> huge cumbersome. I think different plugins don't need to register
> different event triggers. We can have the common event triggers to
> write logical-DDL WAL and register them when the first logical slot is
> created. But as I mentioned, we need to be careful about both the
> concurrent slot creation/drop and the fact that slot creation/drop
> operations are not transactional. Also, we cannot create event
> triggers on the replicas even if a logical slot is created there. If a
> failover happens before applying the WAL of creating event triggers,
> we would need to somehow make sure that even triggers are created on
> the new primary if it has logical slots.
>
I've reviewed and researched the last proposed DDL deparse patch[1],
and found that it can generate a set of commands for a single command.
For example, deparsing "create table test_serial (a serial)"
generates:
- CREATE SEQUENCE public.test_serial_a_seq CACHE 1 NO CYCLE INCREMENT
BY 1 MINVALUE 1 MAXVALUE 2147483647 START WITH 1 RESTART 1 AS
pg_catalog.int4;
- CREATE TABLE public.test_serial (a pg_catalog.int4 STORAGE PLAIN NOT
NULL DEFAULT pg_catalog.nextval('public.test_serial_a_seq'::pg_catalog.regclass));
- ALTER SEQUENCE public.test_serial_a_seq OWNED BY public.test_serial.a;
And deparsing "create table test_b (a int, b int references test_a
(a))" generates:
- CREATE TABLE public.test_b (a pg_catalog.int4 STORAGE PLAIN, b
pg_catalog.int4 STORAGE PLAIN);
- ALTER TABLE public.test_b ADD CONSTRAINT test_b_b_fkey FOREIGN KEY
(b) REFERENCES public.test_a(a);
The reason is that the patch deparses after the command runs (via an
event trigger at ddl_command_end) and rebuilds the DDL from the
catalog. By that time, the server has already expanded one CREATE
TABLE into several internal sub-commands: transformCreateStmt() turns
a serial column into a CREATE SEQUENCE plus ALTER SEQUENCE ... OWNED
BY, a foreign key into an ALTER TABLE ADD CONSTRAINT, a primary key
into an index, and so on. So the output looks much like pg_dump or the
proposed pg_get_table_ddl() [2], because it also rebuilds from the
catalog, not from the original parse tree.
For a general DDL deparse feature, is it useful to turn one command
into several, or to emit a command that is different from what the
user ran? Since DDL deparsing can be used also for schema-qualifying
DDL commands, I considered it's useful for audit purposes or CDC use
cases. But if the generated DDL command could be different or even
multiple DDL commands are generated from one DDL command, I'm not sure
it's useful for other use cases than DDL replication.
Even in DDL replication use cases, I'm concerned that it might be
confusing users. For example, what if DDL replication has DDL command
filter and users specify it to replicate only 'CREATE TABLE' and not
for 'ALTER TABLE'? Users might expect all replicated DDL commands are
'CREATE TABLE' but it would not be able to replicate some form of
CREATE TABLE without ALTER TABLE or CREATE/ALTER SEQUENCE.
Also, given that the proposed pg_get_table_ddl()[2] has approximately
1800 lines to support generating CREATE TABLE commands for the given
table, does it really make sense to have additional 3000 lines to
support DDL deparsing that works in mostly the same way but based on
parse trees? While the JSON blob idea is flexible and preferable,
considering that two features seem to work in mostly the same way,
I'm not sure it can justify the implementation costs.
I think that DDL deparse should keep the intention and the form of the
original command as much as possible: do not turn one command into
several commands, and do not add options that the user did not write.
I see two benefits:
First, as a feature on its own, this makes DDL deparse usable for pure
schema-qualification. That is useful outside DDL replication too, for
example for audit or CDC. (Some audit cases may instead want the fully
expanded form. We could offer that as an option, but I think the
default should stay faithful to the original command.)
Second, as a building block for DDL replication, keeping the original
command lets the subscriber benefit from improvements in its own
version. Some commands may take a lighter lock, or use better new
defaults. For example, in v19 the default TOAST compression became
lz4, while it was pglz before. If we had DDL replication in v18 and a
user set up a logical replication v18 -> v19, I think they would not
want new tables on v19 to be forced to pglz just because that was the
publisher's default. If a user does want exactly the same options on
the subscriber, we can add an option to include the options that were
not explicitly specified too. But even then I do not think we should
produce multiple commands from one command.
Another point I think we should discuss is how to capture DDL events.
The last patch creates event triggers automatically at CREATE
PUBLICATION. I am not sure this is the best approach:
Third-party logical decoding output plugins do not use PUBLICATION, so
they cannot easily capture DDL events this way.
We could instead create a common event trigger when the first logical
slot is created. But it is not easy to make sure the trigger is
reliably created and dropped, including on replicas (for example, a
slot can exist only on a replica that is later promoted).
I think we can decouple the DDL capture infrastructure from event
triggers, and use it as common infrastructure for both event triggers
and WAL-logging of DDL for replication. That way, I think we can
trigger DDL deparse while not relying on event triggers and minimizing
the code duplication. I'm drafting the patch for this idea and
feedback is very welcome.
Regards,
[1] https://www.postgresql.org/message-id/OS0PR01MB57163E6487EFF7378CB8E17C9438A%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[2] https://www.postgresql.org/message-id/CANxoLDfjQnhM%3DE6JSyYo9s9OdjqoN8s_3wE5yL%3DkaDu_X8j-dA%40mail.gmail.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Mohamed ALi | 2026-06-11 07:35:14 | Re: [PATCH] vacuumdb: Add --exclude-database option |
| Previous Message | Ewan Young | 2026-06-11 07:03:54 | [PATCH] seg: preserve the upper boundary's certainty indicator in seg_out() |
| From | Date | Subject | |
|---|---|---|---|
| Previous Message | Klaus Darilion | 2026-06-08 07:57:39 | WAL compatibility between Primary and Standby Server, chained clients |