RE: Data is copied twice when specifying both child and parent table in publication

From: "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>
To: Jacob Champion <jchampion(at)timescale(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, vignesh C <vignesh21(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>
Subject: RE: Data is copied twice when specifying both child and parent table in publication
Date: 2023-03-28 09:59:49
Message-ID: OS3PR01MB627567D49DE885BCA2B604729E889@OS3PR01MB6275.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tues, Mar 28, 2023 at 7:02 AM Jacob Champion <jchampion(at)timescale(dot)com> wrote:
> On Mon, Mar 20, 2023 at 11:22 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > If the tests you have in mind are only related to this patch set then
> > feel free to propose them here if you feel the current ones are not
> > sufficient.
>
> I think the new tests added by Wang cover my concerns (thanks!). I share
> Peter's comment that we don't seem to have a regression test covering
> only the bug description itself -- just ones that combine that case with
> row and column restrictions -- but if you're all happy with the existing
> approach then I have nothing much to add there.

The scenario of this bug is to subscribe to two publications at the same time,
and these two publications publish parent table and child table respectively.
And option via_root is specified in both publications or only in the publication
of the parent table. At this time, the data on the publisher-side will be copied
twice (the data will be copied to the two tables on the subscribe-side
respectively).
So, I think we have covered this bug itself in 013_partition.pl. We inserted the
initial data into the parent table tab4 on the publisher-side, and checked
whether the sync is completed as we expected (there is data in table tab4, but
there is no data in table tab4_1).

> I was staring at this subquery in fetch_table_list():
>
> > + " ( SELECT array_agg(a.attname ORDER BY a.attnum)\n"
> > + " FROM pg_attribute a\n"
> > + " WHERE a.attrelid = gpt.relid AND\n"
> > + " a.attnum = ANY(gpt.attrs)\n"
> > + " ) AS attnames\n"
>
> On my machine this takes up roughly 90% of the runtime of the query,
> which makes for a noticeable delay with a bigger test case (a couple of
> FOR ALL TABLES subscriptions on the regression database). And it seems
> like we immediately throw all that work away: if I understand correctly,
> we only use the third column for its interaction with DISTINCT. Would it
> be enough to just replace that whole thing with gpt.attrs?

Make sense.
Changed as suggested.

Attach the new patch.

Regards,
Wang Wei

Attachment Content-Type Size
v25-0001-Avoid-syncing-data-twice-for-the-publish_via_par.patch application/octet-stream 27.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-03-28 10:07:52 Re: Add standard collation UNICODE
Previous Message Dave Page 2023-03-28 09:50:07 Re: Remove 'htmlhelp' documentat format (was meson documentation build open issues)