Re: speed up a logical replica setup

From: "Euler Taveira" <euler(at)eulerto(dot)com>
To: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: "vignesh C" <vignesh21(at)gmail(dot)com>, "Michael Paquier" <michael(at)paquier(dot)xyz>, "Peter Eisentraut" <peter(at)eisentraut(dot)org>, "Andres Freund" <andres(at)anarazel(dot)de>, "Ashutosh Bapat" <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, "Amit Kapila" <amit(dot)kapila16(at)gmail(dot)com>, "Shlok Kyal" <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Subject: Re: speed up a logical replica setup
Date: 2024-02-01 00:04:55
Message-ID: 80ce3f65-7ca3-471e-b1a4-24ac529ff4ea@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 30, 2024, at 6:26 AM, Hayato Kuroda (Fujitsu) wrote:
> > One open item that is worrying me is how to handle the pg_ctl timeout. This
> > patch does nothing and the user should use PGCTLTIMEOUT environment variable to
> > avoid that the execution is canceled after 60 seconds (default for pg_ctl).
> > Even if you set a high value, it might not be enough for cases like
> > time-delayed replica. Maybe pg_ctl should accept no timeout as --timeout
> > option. I'll include this caveat into the documentation but I'm afraid it is
> > not sufficient and we should provide a better way to handle this situation.
>
> I felt you might be confused a bit. Even if the recovery_min_apply_delay is set,
> e.g., 10h., the pg_ctl can start and stop the server. This is because the
> walreceiver serialize changes as soon as they received. The delay is done by the
> startup process. There are no unflushed data, so server instance can be turned off.
> If you meant the combination of recovery-timeout and time-delayed replica, yes,
> it would be likely to occur. But in the case, using like --no-timeout option is
> dangerous. I think we should overwrite recovery_min_apply_delay to zero. Thought?

I didn't provide the whole explanation. I'm envisioning the use case that pg_ctl
doesn't reach the consistent state and the timeout is reached (the consequence
is that pg_createsubscriber aborts the execution). It might occur on a busy
server. The probability that it occurs with the current code is low (LSN gap
for recovery is small). Maybe I'm anticipating issues when the base backup
support is added but better to raise concerns during development.

> Below part contains my comments for v11-0001. Note that the ordering is random.

Hayato, thanks for reviewing v11.

> 01. doc
> ```
> <group choice="req">
> <arg choice="plain"><option>-D</option> </arg>
> <arg choice="plain"><option>--pgdata</option></arg>
> </group>
> ```
>
> According to other documentation like pg_upgrade, we do not write both longer
> and shorter options in the synopsis section.

pg_upgrade doesn't but others do like pg_rewind, pg_resetwal, pg_controldata,
pg_checksums. It seems newer tools tend to provide short and long options.

> 02. doc
> ```
> <para>
> <application>pg_createsubscriber</application> takes the publisher and subscriber
> connection strings, a cluster directory from a physical replica and a list of
> database names and it sets up a new logical replica using the physical
> recovery process.
> </para>
>
> ```
>
> I found that you did not include my suggestion without saying [1]. Do you dislike
> the comment or still considering?

Documentation is on my list. I didn't fix the documentation since some design
decisions were changed. I'm still working on it.

> 03. doc
> ```
> <term><option>-P <replaceable class="parameter">connstr</replaceable></option></term>
> ```
>
> Too many blank after -P.

Fixed.

[documentation related items will be addressed later...]

>
> 07. general
> I think there are some commenting conversions in PG, but this file breaks it.

It is on my list.

> 08. general
> Some pg_log_error() + exit(1) can be replaced with pg_fatal().

Done. I kept a few pg_log_error() + exit() because there is no
pg_fatal_and_hint() function.

>
> 09. LogicalRepInfo
> ```
> char *subconninfo; /* subscription connection string for logical
> * replication */
> ```
>
> As I posted in comment#8[2], I don't think it is "subscription connection". Also,
> "for logical replication" is bit misreading because it would not be passed to
> workers.

Done.

s/publication/publisher/
s/subscription/subscriber/

> 10. get_base_conninfo
> ```
> static char *
> get_base_conninfo(char *conninfo, char *dbname, const char *noderole)
> ...
> /*
> * If --database option is not provided, try to obtain the dbname from
> * the publisher conninfo. If dbname parameter is not available, error
> * out.
> */
>
> ```
>
> I'm not sure getting dbname from the conninfo improves user-experience. I felt
> it may trigger an unintended targeting.
> (I still think the publisher-server should be removed)

Why not? Unique database is a common setup. It is unintended if you don't
document it accordingly. I'll make sure it is advertised in the --database and
the --publisher-server options.

> 11. check_data_directory
> ```
> /*
> * Is it a cluster directory? These are preliminary checks. It is far from
> * making an accurate check. If it is not a clone from the publisher, it will
> * eventually fail in a future step.
> */
> static bool
> check_data_directory(const char *datadir)
> ```
>
> We shoud also check whether pg_createsubscriber can create a file and a directory.
> GetDataDirectoryCreatePerm() verifies it.

Good point. It is included in the next patch.

> 12. main
> ```
> /* Register a function to clean up objects in case of failure. */
> atexit(cleanup_objects_atexit);
> ```
>
> According to the manpage, callback functions would not be called when it exits
> due to signals:
>
> > Functions registered using atexit() (and on_exit(3)) are not called if a
> > process terminates abnormally because of the delivery of a signal.
>
> Do you have a good way to handle the case? One solution is to output created
> objects in any log level, but the consideration may be too much. Thought?

Nothing? If you interrupt the execution, there will be objects left behind and
you, as someone that decided to do it, have to clean things up. What do you
expect this tool to do? The documentation will provide some guidance informing
the object name patterns this tool uses and you can check for leftover objects.
Someone can argue that is a valid feature request but IMO it is not one in the
top of the list.

> 13, main
> ```
> /*
> * Create a temporary logical replication slot to get a consistent LSN.
> ```
>
> Just to clarify - I still think the process exits before here in case of dry run.
> In case of pg_resetwal, the process exits before doing actual works like
> RewriteControlFile().

Why? Are you suggesting that the dry run mode covers just the verification
part? If so, it is not a dry run mode. I would expect it to run until the end
(or until it accomplish its goal) but *does not* modify data. For pg_resetwal,
the modification is one of the last steps and the other ones (KillFoo
functions) that are skipped modify data. It ends the dry run mode when it
accomplish its goal (obtain the new control data values). If we stop earlier,
some of the additional steps won't be covered by the dry run mode and a failure
can happen but could be detected if you run a few more steps.

> 14. main
> ```
> * XXX we might not fail here. Instead, we provide a warning so the user
> * eventually drops this replication slot later.
> ```
>
> But there are possibilities to exit(1) in drop_replication_slot(). Is it acceptable?

No, there isn't.

> 15. wait_for_end_recovery
> ```
> /*
> * Bail out after recovery_timeout seconds if this option is set.
> */
> if (recovery_timeout > 0 && timer >= recovery_timeout)
> ```
>
> Hmm, IIUC, it should be enabled by default [3]. Do you have anything in your mind?

Why? See [1]. I prefer the kind mode (always wait until the recovery ends) but
you and Amit are proposing a more aggressive mode. The proposal (-t 60) seems
ok right now, however, if the goal is to provide base backup support in the
future, you certainly should have to add the --recovery-timeout in big clusters
or those with high workloads because base backup is run between replication slot
creation and consistent LSN. Of course, we can change the default when base
backup support is added.

> 16. main
> ```
> /*
> * Create the subscription for each database on subscriber. It does not
> * enable it immediately because it needs to adjust the logical
> * replication start point to the LSN reported by consistent_lsn (see
> * set_replication_progress). It also cleans up publications created by
> * this tool and replication to the standby.
> */
> if (!setup_subscriber(dbinfo, consistent_lsn))
> ```
>
> Subscriptions would be created and replication origin would be moved forward here,
> but latter one can be done only by the superuser. I felt that this should be
> checked in check_subscriber().

Good point. I included a check for pg_create_subscription role and CREATE
privilege on the specified database.

> 17. main
> ```
> /*
> * Change system identifier.
> */
> modify_sysid(pg_resetwal_path, subscriber_dir);
> ```
>
> Even if I executed without -v option, an output from pg_resetwal command appears.
> It seems bit strange.

The pg_resetwal is using a printf and there is no prefix that identifies that
message is from pg_resetwal. That's message has been bothering me for a while
so let's send it to /dev/null. I'll include it in the next patch.

RewriteControlFile();
KillExistingXLOG();
KillExistingArchiveStatus();
KillExistingWALSummaries();
WriteEmptyXLOG();

printf(_("Write-ahead log reset\n"));
return 0;

[1] https://www.postgresql.org/message-id/b315c7da-7ab1-4014-a2a9-8ab6ae26017c%40app.fastmail.com

--
Euler Taveira
EDB https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-02-01 00:27:43 Re: src/bin/pg_upgrade/t/004_subscription.pl test comment fix
Previous Message Alvaro Herrera 2024-01-31 23:31:34 Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock