Re: Improve the connection failure error messages

From: Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improve the connection failure error messages
Date: 2024-01-17 08:14:53
Message-ID: CABdArM6ORu+KpS_kXd-jwwPBqYPo1YqZjwwGnqmVanWgdHCggA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> ~~
>
> BTW, while experimenting with the bad connection ALTER I also tried
> setting 'disable_on_error' like below:
>
> ALTER SUBSCRIPTION sub4 SET (disable_on_error);
> ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';
>
> ...but here the subscription did not become DISABLED as I expected it
> would do on the next connection error iteration. It remains enabled
> and just continues to loop relaunch/ERROR indefinitely same as before.
>
> That looks like it may be a bug. Thoughts?
>
Ideally, if the already running apply worker in
"LogicalRepApplyLoop()" has any exception/error it will be handled and
the subscription will be disabled if 'disable_on_error' is set -

start_apply(XLogRecPtr origin_startpos)
{
PG_TRY();
{
LogicalRepApplyLoop(origin_startpos);
}
PG_CATCH();
{
if (MySubscription->disableonerr)
DisableSubscriptionAndExit();
...

What is happening in this case is that the control reaches the function -
run_apply_worker() -> start_apply() -> LogicalRepApplyLoop ->
maybe_reread_subscription()
...
/*
* Exit if any parameter that affects the remote connection was changed.
* The launcher will start a new worker but note that the parallel apply
* worker won't restart if the streaming option's value is changed from
* 'parallel' to any other value or the server decides not to stream the
* in-progress transaction.
*/
if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
...

and it sees a change in the parameter and calls apply_worker_exit().
This will exit the current process, without throwing an exception to
the caller and the postmaster will try to restart the apply worker.
The new apply worker, before reaching the start_apply() [where we
handle exception], will hit the code to establish the connection to
the publisher -

ApplyWorkerMain() -> run_apply_worker() -
...
LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo,
true /* replication */ ,
true,
must_use_password,
MySubscription->name, &err);

if (LogRepWorkerWalRcvConn == NULL)
ereport(ERROR,
(errcode(ERRCODE_CONNECTION_FAILURE),
errmsg("could not connect to the publisher: %s", err)));
...
and due to the bad connection string in the subscription, it will error out.
[28680] ERROR: could not connect to the publisher: invalid port number: "-1"
[3196] LOG: background worker "logical replication apply worker" (PID
28680) exited with exit code 1

Now, the postmaster keeps trying to restart the apply worker and it
will keep failing until the connection string is corrected or the
subscription is disabled manually.

I think this is a bug that needs to be handled in run_apply_worker()
when disable_on_error is set.
IMO, this bug-fix discussion deserves a separate thread. Thoughts?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kartyshov Ivan 2024-01-17 08:16:35 Re: [HACKERS] make async slave to wait for lsn to be replayed
Previous Message Konstantin Knizhnik 2024-01-17 08:04:43 Re: index prefetching