Re: Allow logical replication to copy tables in binary format

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>
Cc: "Takamichi Osumi (Fujitsu)" <osumi(dot)takamichi(at)fujitsu(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Allow logical replication to copy tables in binary format
Date: 2023-01-30 12:34:20
Message-ID: CALj2ACW5Oa7_v25iZb326UXvtM_tjQfw0Tc3hPJ8zN4FZqc9cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 30, 2023 at 4:19 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
>

Thanks for providing an updated patch.

>> On Thu, Jan 12, 2023 at 1:53 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
>> 1. The performance numbers posted upthread [1] look impressive for the
>> use-case tried, that's a 2.25X improvement or 55.6% reduction in
>> execution times. However, it'll be great to run a few more varied
>> tests to confirm the benefit.
>
> Sure, do you have any specific test case or suggestion in mind?

Have a huge amount of publisher's table (with mix of columns like int,
text, double, bytea and so on) prior data so that the subscriber's
table sync workers have to do a "good" amount of work to copy, then
measure the copy_table time with and without patch.

>> 2. It'll be great to capture the perf report to see if the time spent
>> in copy_table() is reduced with the patch.
>
> Will provide that in another email soon.

Thanks.

>> 4. Why to add tests to existing 002_types.pl? Can we add a new file
>> with all the data types covered?
>
> Since 002_types.pl is where the replication of data types are covered. I thought it would be okay to test replication with the binary option in that file.
> Sure, we can add a new file with different data types for testing subscriptions with binary option. But then I feel like it would have too many duplicated lines with 002_types.pl.
> If you think that 002_types.pl lacks some data types needs to be tested, then we should add those into 002_types.pl too whether we test subscription with binary option in that file or some other place, right?
>
> Previously, it wasn't actually testing the initial table sync since all tables were empty when subscription was created.
> I just simply split the data initially inserted to test initial table sync.
>
> With this patch, it inserts the first two rows for all data types before subscriptions get created.
> You can see these lines:

It'd be better and clearer to have a separate TAP test file IMO since
the focus of the feature here isn't to just test for data types. With
separate tests, you can verify "ERROR: incorrect binary data format
in logical replication column 1" cases.

>> 8. Note that the COPY binary format isn't portable across platforms
>> (Windows to Linux for instance) or major versions
>> https://www.postgresql.org/docs/devel/sql-copy.html, whereas, logical
>> replication is https://www.postgresql.org/docs/devel/logical-replication.html.
>> I don't see any handling of such cases in copy_table but only a check
>> for the publisher version. I think we need to account for all the
>> cases - allow binary COPY only when publisher and subscriber are of
>> same versions, architectures, platforms. The trick here is how we
>> identify if the subscriber is of the same type and taste
>> (architectures and platforms) as the publisher. Looking for
>> PG_VERSION_STR of publisher and subscriber might be naive, but I'm not
>> sure if there's a better way to do it.
>
> I think having the "copy_format" option with text as default, like I replied to your 3rd review above, will keep things as they are now.
> The patch now will only allow users to choose binary copy as well, if they want it and acknowledge the limitations that come with binary copy.
> COPY command's portability issues shouldn't be an issue right now, since the patch still supports text format. Right?

With the above said, do you think checking for publisher versions is
needed? The user can decide to enable/disable binary COPY based on the
publisher's version no?
+ /* If the publisher is v16 or later, specify the format to copy data. */
+ if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 160000)
+ {

Few more comments on v7:
1.
+ Specifies the format in which pre-existing data on the publisher will
+ copied to the subscriber. Supported formats are
+ <literal>text</literal> and <literal>binary</literal>. The default is
+ <literal>text</literal>.
It'll be good to call out the cases in the documentation as to where
copy_format can be enabled and needs to be disabled.

2.
+ errmsg("%s value should be either \"text\" or \"binary\"",
How about "value must be either ...."?

3.
+ if (!opts->binary &&
+ opts->copy_format == LOGICALREP_COPY_AS_BINARY)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("%s and %s are mutually exclusive options",
+ "binary = false", "copy_format = binary")));

+ "CREATE SUBSCRIPTION tap_sub_binary CONNECTION
'$publisher_connstr' PUBLICATION tap_pub WITH (slot_name =
tap_sub_binary_slot, binary = true, copy_format = 'binary')"
Why should the subscription's binary option and copy_format option be
tied at all? Tying these two options hurts usability. Is there a
fundamental reason? I think they both are/must be independent. One
deals with data types and another deals with how initial table data is
copied.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-01-30 12:41:20 Re: Assertion failure in SnapBuildInitialSnapshot()
Previous Message Przemysław Sztoch 2023-01-30 12:21:01 Re: generate_series for timestamptz and time zone problem