Re: speed up a logical replica setup

From: "Euler Taveira" <euler(at)eulerto(dot)com>
To: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, "vignesh C" <vignesh21(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Michael Paquier" <michael(at)paquier(dot)xyz>, "Peter Eisentraut" <peter(at)eisentraut(dot)org>, "Andres Freund" <andres(at)anarazel(dot)de>, "Ashutosh Bapat" <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, "Amit Kapila" <amit(dot)kapila16(at)gmail(dot)com>, "Shlok Kyal" <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
Subject: Re: speed up a logical replica setup
Date: 2024-03-01 20:48:20
Message-ID: e390e35e-508e-4eb8-92e4-e6b066407a41@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 22, 2024, at 12:45 PM, Hayato Kuroda (Fujitsu) wrote:
> Based on idea from Euler, I roughly implemented. Thought?
>
> 0001-0013 were not changed from the previous version.
>
> V24-0014: addressed your comment in the replied e-mail.
> V24-0015: Add disconnect_database() again, per [3]
> V24-0016: addressed your comment in [4].
> V24-0017: addressed your comment in [5].
> V24-0018: addressed your comment in [6].

Thanks for your review. I'm attaching v25 that hopefully addresses all pending
points.

Regarding your comments [1] on v21, I included changes for almost all items
except 2, 20, 23, 24, and 25. It still think item 2 is not required because
pg_ctl will provide a suitable message. I decided not to rearrange the block of
SQL commands (item 20) mainly because it would avoid these objects on node_f.
Do we really need command_checks_all? Depending on the output it uses
additional cycles than command_ok.

In summary:

v24-0002: documentation is updated. I didn't apply this patch as-is. Instead, I
checked what you wrote and fix some gaps in what I've been written.
v24-0003: as I said I don't think we need to add it, however, I won't fight
against it if people want to add this check.
v24-0004: I spent some time on it. This patch is not completed. I cleaned it up
and include the start_standby_server code. It starts the server using the
specified socket directory, port and username, hence, preventing external client
connections during the execution.
v24-0005: partially applied
v24-0006: applied with cosmetic change
v24-0007: applied with cosmetic change
v24-0008: applied
v24-0009: applied with cosmetic change
v24-0010: not applied. Base on #15, I refactored this code a bit. pg_fatal is
not used when there is a database connection open. Instead, pg_log_error()
followed by disconnect_database(). In cases that it should exit immediately,
disconnect_database() has a new parameter (exit_on_error) that controls if it
needs to call exit(1). I go ahead and did the same for connect_database().
v24-0011: partially applied. I included some of the suggestions (18, 19, and 21).
v24-0012: not applied. Under reflection, after working on v24-0004, the target
server will start with new parameters (that only accepts local connections),
hence, during the execution is not possible anymore to detect if the target
server is a primary for another server. I added a sentence for it in the
documentation (Warning section).
v24-0013: good catch. Applied.
v24-0014: partially applied. After some experiments I decided to use a small
number of attempts. The current code didn't reset the counter if the connection
is reestablished. I included the documentation suggestion. I didn't include the
IF EXISTS in the DROP PUBLICATION because it doesn't solve the issue. Instead,
I refactored the drop_publication() to not try again if the DROP PUBLICATION
failed.
v24-0015: not applied. I refactored the exit code to do the right thing: print
error message, disconnect database (if applicable) and exit.
v24-0016: not applied. But checked if the information was presented in the
documentation; it is.
v24-0017: good catch. Applied.
v24-0018: not applied.

I removed almost all boolean return and include the error logic inside the
function. It reads better. I also changed the connect|disconnect_database
functions to include the error logic inside it. There is a new parameter
on_error_exit for it. I removed the action parameter from pg_ctl_status() -- I
think someone suggested it -- and the error message was moved to outside the
function. I improved the cleanup routine. It provides useful information if it
cannot remove the object (publication or replication slot) from the primary.

Since I applied v24-0004, I realized that extra start / stop service are
required. It mean pg_createsubscriber doesn't start the transformation with the
current standby settings. Instead, it stops the standby if it is running and
start it with the provided command-line options (socket, port,
listen_addresses). It has a few drawbacks:
* See v34-0012. It cannot detect if the target server is a primary for another
server. It is documented.
* I also removed the check for standby is running. If the standby was stopped a
long time ago, it will take some time to reach the start point.
* Dry run mode has to start / stop the service to work correctly. Is it an
issue?

However, I decided to include --retain option, I'm thinking about to remove it.
If the logging is enabled, the information during the pg_createsubscriber will
be available. The client log can be redirected to a file for future inspection.

Comments?

[1] https://www.postgresql.org/message-id/TYCPR01MB12077756323B79042F29DDAEDF54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

--
Euler Taveira
EDB https://www.enterprisedb.com/

Attachment Content-Type Size
v25-0001-pg_createsubscriber-creates-a-new-logical-replic.patch text/x-patch 92.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2024-03-01 21:03:14 Re: ALTER TABLE SET ACCESS METHOD on partitioned tables
Previous Message Paul Jungwirth 2024-03-01 20:38:27 Re: SQL:2011 application time