Re: Add support for specifying tables in pg_createsubscriber.

From: "Euler Taveira" <euler(at)eulerto(dot)com>
To: "'Shubham Khanna'" <khannashubham1197(at)gmail(dot)com>, "Peter Smith" <smithpb2250(at)gmail(dot)com>
Cc: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "vignesh C" <vignesh21(at)gmail(dot)com>, "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add support for specifying tables in pg_createsubscriber.
Date: 2025-08-22 03:18:37
Message-ID: e1b982b6-f8db-4e2b-9a9a-25f3a2acf0d1@app.fastmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 21, 2025, at 6:08 AM, Shubham Khanna wrote:
> Attachments:
> * v3-0001-Support-tables-via-pg_createsubscriber.patch
> * v3-0002-Support-WHERE-clause-and-COLUMN-list-in-table-arg.patch

+ <term><option>--table=<replaceable class="parameter">table</replaceable></option></term>
+ <listitem>
+ <para>
+ Adds one or more specific tables to the publication for the most recently
+ specified <option>--database</option>. This option can be given multiple
+ times to include additional tables.
+ </para>

I think it is a really bad idea to rely on the order of options to infer which
tables are from which database. The current design about publication and
subscription names imply order but it doesn't mean subscription name must be
the next option after the publication name.

Your explanation from the initial email:

> The attached patch introduces a new '--table' option that can be
> specified after each '--database' argument. It allows users to
> selectively replicate specific tables within a database instead of
> defaulting to all tables. The syntax is like that used in 'vacuumdb'
> and supports multiple '--table' arguments per database, including
> optional column lists and row filters.

vacuumdb doesn't accept multiple --table options if you specify multiple
databases.

$ vacuumdb -a -t public.pgbench_branches -t public.pgbench_tellers -d db1 -d db2
vacuumdb: error: cannot vacuum all databases and a specific one at the same time

However, it accepts multiple --table options if you specify --all options.
(Although, I think it needs better error handling.)

$ vacuumdb -a -t public.pgbench_branches
vacuumdb: vacuuming database "contrib_regression"
vacuumdb: error: query failed: ERROR: relation "public.pgbench_branches" does not exist
LINE 2: VALUES ('public.pgbench_branches'::pg_catalog.regclass, NU...
^
vacuumdb: detail: Query was: WITH listed_objects (object_oid, column_list) AS (
VALUES ('public.pgbench_branches'::pg_catalog.regclass, NULL::pg_catalog.text)
)

Let's recap the initial goal: pg_createsubscriber creates a new logical replica
from a physical standby server. Your proposal is extending the tool to create a
partial logical replica but doesn't mention what you would do with the other
part; that is garbage after the conversion. I'm not convinced that the current
proposal is solid as-is.

+ <para>
+ The argument must be a fully qualified table name in one of the
+ following forms:
+ <itemizedlist><listitem><para><literal>schema.table</literal></para></listitem>
+ <listitem><para><literal>db.schema.table</literal></para></listitem></itemizedlist>
+ If the database name is provided, it must match the most recent
+ <option>--database</option> argument.
+ </para>

Why do you want to include the database if you already specified it?

+ <para>
+ A table specification may also include an optional column list and/or
+ row filter:
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>schema.table(col1, col2, ...)</literal> &mdash; publishes
+ only the specified columns.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>schema.table WHERE (predicate)</literal> &mdash; publishes
+ only rows that satisfy the given condition.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Both forms can be combined, e.g.
+ <literal>schema.table(col1, col2) WHERE (id &gt; 100)</literal>.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>

This is a bad idea for some cases. Let's say your setup involves a filter that
uses only 10% of the rows from a certain table. It is better to do a manual
setup. Besides that, expressions in options can open a can of worms. In the
column list case, there might be corner cases like if you have a constraint in
a certain column and that column was not included in the column list, the setup
will fail; there isn't a cheap way to detect such cases.

It seems this proposal doesn't serve a general purpose. It is copying a *whole*
cluster to use only a subset of tables. Your task with pg_createsubscriber is
more expensive than doing a manual logical replication setup. If you have 500
tables and want to replicate only 400 tables, it doesn't seem productive to
specify 400 -t options. There are some cases like a small set of big tables
that this feature makes sense. However, I'm wondering if a post script should
be used to adjust your setup. There might be cases that involves only dozens of
tables but my experience says it is rare. My expectation is that this feature
is useful for avoiding some specific tables. Hence, the copy of the whole
cluster is worthwhile.

--
Euler Taveira
EDB https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2025-08-22 03:36:45 Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
Previous Message shveta malik 2025-08-22 03:11:42 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart