RE: Data is copied twice when specifying both child and parent table in publication

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: RE: Data is copied twice when specifying both child and parent table in publication
Date: 2021-10-28 07:35:07
Message-ID: OS0PR01MB5716C756312959F293A822C794869@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

As there are basically two separate issues mentioned in the thread, I tried to
summarize the discussion so far which might be helpful to others.

* The first issue[1]:

If we include both the partitioned table and (explicitly) its child partitions
in the publication when set publish_via_partition_root=true, like:
---
CREATE PUBLICATION pub FOR TABLE parent_table, child_table with (publish_via_partition_root=on);
---
It could execute initial sync for both the partitioned(parent_table) table and
(explicitly) its child partitions(child_table) which cause duplication of
data in partition(child_table) in subscriber side.

The reasons I considered this behavior a bug are:

a) In this case, the behavior of initial sync is inconsistent with the behavior
of transaction streaming. All changes in the leaf the partition will be applied
using the identity and schema of the partitioned(root) table. But for the
initial sync, it will execute sync for both the partitioned(root) table and
(explicitly) its child partitions which cause duplication of data.

b) The behavior of FOR TABLE is inconsistent with the behavior of FOR ALL TABLE.
If user create a FOR ALL TABLE publication and set publish_via_partition_root=true,
then only the top most partitioned(root) table will execute initial sync.

IIRC, most people in this thread agreed that the current behavior is not
expected. So, maybe it's time to try to fix it.

Attach my fix patch here. The patch try to fix this issue by making the
pg_publication_tables view only show partitioned table when
publish_via_partition_root is true.

* The second issue[2]:
-----
CREATE TABLE sale (sale_date date not null,country_code text, product_sku text,
units integer) PARTITION BY RANGE (sale_date);
CREATE TABLE sale_201901 PARTITION OF sale FOR VALUES FROM ('2019-01-01') TO
('2019-02-01');
CREATE TABLE sale_201902 PARTITION OF sale FOR VALUES FROM ('2019-02-01') TO
('2019-03-01');

(1) PUB: CREATE PUBLICATION pub FOR TABLE sale_201901,
sale_201902 WITH (publish_via_partition_root=true);
(2) SUB: CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres host=localhost port=5432' PUBLICATION pub;
(3) PUB: INSERT INTO sale VALUES('2019-01-01', 'AU', 'cpu', 5), ('2019-01-02', 'AU', 'disk', 8);
(4) SUB: SELECT * FROM sale;
(5) PUB: ALTER PUBLICATION pub ADD TABLE sale;
(6) SUB: ALTER SUBSCRIPTION sub REFRESH PUBLICATION;
(7) SUB: SELECT * FROM sale;
-----

In step (7), we can see duplication of data.

The reason is that the INSERTed data is first published though the partitions,
since initially there is no partitioned table in the publication (so
publish_via_partition_root=true doesn't have any effect). But then adding the
partitioned table to the publication and refreshing the publication in the
subscriber, the data is then published "using the identity and schema of the
partitioned table" due to publish_via_partition_root=true.
(Copied from Greg's analysis).

Whether this behavior is correct is still under debate.

Overall, I think the second issue still needs further discussion while the
first issue seems clear that most people think it's unexpected. So, I think it
might be better to fix the first issue.

[1] https://www.postgresql.org/message-id/OS0PR01MB57167F45D481F78CDC5986F794B99%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[2] https://www.postgresql.org/message-id/flat/CAJcOf-d8SWk3z3fJaLW9yuVux%3D2ESTsXOSdKzCq1O3AWBpgnMQ%40mail.gmail.com#fc96a42158b5e98ace26d077a6f7eac5

Best regards,
Hou zj

Attachment Content-Type Size
v4-0001-fix-data-double-published.patch application/octet-stream 4.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-10-28 07:57:29 Re: inefficient loop in StandbyReleaseLockList()
Previous Message Erik Rijkers 2021-10-28 07:29:10 Re: [PATCH] Proposal for HIDDEN/INVISIBLE column