Re: speed up a logical replica setup

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Euler Taveira <euler(at)eulerto(dot)com>
Cc: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
Subject: Re: speed up a logical replica setup
Date: 2024-03-06 11:24:06
Message-ID: CALDaNm215LodC48p7LmfAsuCq9m33CWtcag2+9DiyNfWGuL_KQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 2 Mar 2024 at 02:19, Euler Taveira <euler(at)eulerto(dot)com> wrote:
>
> On Thu, Feb 22, 2024, at 12:45 PM, Hayato Kuroda (Fujitsu) wrote:
>
> Based on idea from Euler, I roughly implemented. Thought?
>
> 0001-0013 were not changed from the previous version.
>
> V24-0014: addressed your comment in the replied e-mail.
> V24-0015: Add disconnect_database() again, per [3]
> V24-0016: addressed your comment in [4].
> V24-0017: addressed your comment in [5].
> V24-0018: addressed your comment in [6].
>
>
> Thanks for your review. I'm attaching v25 that hopefully addresses all pending
> points.
>
> Regarding your comments [1] on v21, I included changes for almost all items
> except 2, 20, 23, 24, and 25. It still think item 2 is not required because
> pg_ctl will provide a suitable message. I decided not to rearrange the block of
> SQL commands (item 20) mainly because it would avoid these objects on node_f.
> Do we really need command_checks_all? Depending on the output it uses
> additional cycles than command_ok.
>
> In summary:
>
> v24-0002: documentation is updated. I didn't apply this patch as-is. Instead, I
> checked what you wrote and fix some gaps in what I've been written.
> v24-0003: as I said I don't think we need to add it, however, I won't fight
> against it if people want to add this check.
> v24-0004: I spent some time on it. This patch is not completed. I cleaned it up
> and include the start_standby_server code. It starts the server using the
> specified socket directory, port and username, hence, preventing external client
> connections during the execution.
> v24-0005: partially applied
> v24-0006: applied with cosmetic change
> v24-0007: applied with cosmetic change
> v24-0008: applied
> v24-0009: applied with cosmetic change
> v24-0010: not applied. Base on #15, I refactored this code a bit. pg_fatal is
> not used when there is a database connection open. Instead, pg_log_error()
> followed by disconnect_database(). In cases that it should exit immediately,
> disconnect_database() has a new parameter (exit_on_error) that controls if it
> needs to call exit(1). I go ahead and did the same for connect_database().
> v24-0011: partially applied. I included some of the suggestions (18, 19, and 21).
> v24-0012: not applied. Under reflection, after working on v24-0004, the target
> server will start with new parameters (that only accepts local connections),
> hence, during the execution is not possible anymore to detect if the target
> server is a primary for another server. I added a sentence for it in the
> documentation (Warning section).
> v24-0013: good catch. Applied.
> v24-0014: partially applied. After some experiments I decided to use a small
> number of attempts. The current code didn't reset the counter if the connection
> is reestablished. I included the documentation suggestion. I didn't include the
> IF EXISTS in the DROP PUBLICATION because it doesn't solve the issue. Instead,
> I refactored the drop_publication() to not try again if the DROP PUBLICATION
> failed.
> v24-0015: not applied. I refactored the exit code to do the right thing: print
> error message, disconnect database (if applicable) and exit.
> v24-0016: not applied. But checked if the information was presented in the
> documentation; it is.
> v24-0017: good catch. Applied.
> v24-0018: not applied.
>
> I removed almost all boolean return and include the error logic inside the
> function. It reads better. I also changed the connect|disconnect_database
> functions to include the error logic inside it. There is a new parameter
> on_error_exit for it. I removed the action parameter from pg_ctl_status() -- I
> think someone suggested it -- and the error message was moved to outside the
> function. I improved the cleanup routine. It provides useful information if it
> cannot remove the object (publication or replication slot) from the primary.
>
> Since I applied v24-0004, I realized that extra start / stop service are
> required. It mean pg_createsubscriber doesn't start the transformation with the
> current standby settings. Instead, it stops the standby if it is running and
> start it with the provided command-line options (socket, port,
> listen_addresses). It has a few drawbacks:
> * See v34-0012. It cannot detect if the target server is a primary for another
> server. It is documented.
> * I also removed the check for standby is running. If the standby was stopped a
> long time ago, it will take some time to reach the start point.
> * Dry run mode has to start / stop the service to work correctly. Is it an
> issue?
>
> However, I decided to include --retain option, I'm thinking about to remove it.
> If the logging is enabled, the information during the pg_createsubscriber will
> be available. The client log can be redirected to a file for future inspection.
>
> Comments?

Few comments:
1) Can we use strdup here instead of atoi, as we do similarly in
case of pg_dump too, else we will do double conversion, convert using
atoi and again to string while forming the connection string:
+ case 'p':
+ if ((opt.sub_port = atoi(optarg)) <= 0)
+ pg_fatal("invalid subscriber
port number");
+ break;

2) We can have some valid range for this, else we will end up in some
unexpected values when a higher number is specified:
+ case 't':
+ opt.recovery_timeout = atoi(optarg);
+ break;

3) Now that we have addressed most of the items, can we handle this TODO:
+ /*
+ * TODO use primary_conninfo (if available) from subscriber and
+ * extract publisher connection string. Assume that there are
+ * identical entries for physical and logical
replication. If there is
+ * not, we would fail anyway.
+ */
+ pg_log_error("no publisher connection string specified");
+ pg_log_error_hint("Try \"%s --help\" for more
information.", progname);
+ exit(1);

4) By default the log level as info here, I was not sure how to set
it to debug level to get these error messages:
+ pg_log_debug("publisher(%d): connection string: %s",
i, dbinfo[i].pubconninfo);
+ pg_log_debug("subscriber(%d): connection string: %s",
i, dbinfo[i].subconninfo);

5) Currently in non verbose mode there are no messages printed on
console, we could have a few of them printed irrespective of verbose
or not like the following:
a) creating publication
b) creating replication slot
c) waiting for the target server to reach the consistent state
d) If pg_createsubscriber fails after this point, you must recreate
the physical replica before continuing.
e) creating subscription

6) The message should be "waiting for the target server to reach the
consistent state":
+#define NUM_CONN_ATTEMPTS 5
+
+ pg_log_info("waiting the target server to reach the consistent state");
+
+ conn = connect_database(conninfo, true);

Regards,
Vignesh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2024-03-06 11:25:21 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Michael Paquier 2024-03-06 11:21:14 Re: Fix race condition in InvalidatePossiblyObsoleteSlot()