Re: Optionally automatically disable logical replication subscriptions on error

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Smith, Peter" <peters(at)fast(dot)au(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optionally automatically disable logical replication subscriptions on error
Date: 2021-06-22 02:49:56
Message-ID: 0646E5E0-EDF2-4D74-89DD-1C9BF0D7EA67@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jun 21, 2021, at 5:57 PM, Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> * Is the goal to prevent some *unattended* SUBSCRIPTION from going bad
> at some point in future and then going into a relaunch loop for
> days/weeks and causing 1000's of errors without the user noticing. In
> that case, this patch seems to be quite useful, but for this goal
> maybe you don't want to be checking the tablesync workers at all, but
> should only be checking the apply worker like your original v1 patch
> did.

Yeah, my motivation was preventing an infinite loop, and providing a clean way for the users to know that replication they are waiting for won't ever complete, rather than having to infer that it will never halt.

> * Is the goal just to be a convenient way to disable the subscription
> during the CREATE SUBSCRIPTION phase so that the user can make
> corrections in peace without the workers re-launching and making more
> error logs?

No. This is not and never was my motivation. It's an interesting question, but that idea never crossed my mind. I'm not sure what changes somebody would want to make *after* creating the subscription. Certainly, there may be problems with how they have things set up, but they won't know that until the first error happens.

> Here the patch is helpful, but only for simple scenarios
> like 1 faulty table. Imagine if there are 10 tables (all with PK
> violations at DATASYNC copy) then you will encounter them one at a
> time and have to re-enable the subscription 10 times, after fixing
> each error in turn.

You are assuming disable_on_error=true. It is false by default. But ok, let's accept that assumption for the sake of argument. Now, will you have to manually go through the process 10 times? I'm not sure. The user might figure out their mistake after seeing the first error.

> So in this scenario the new option might be more
> of a hindrance than a help because it would be easier if the user just
> did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the
> problems in one sitting before re-enabling.

Yeah, but since the new option is off by default, I don't see any sensible complaint.

>
> * etc
>
> //////////
>
> Finally, here is one last (crazy?) thought-bubble just for
> consideration. I might be wrong, but my gut feeling is that the Stats
> Collector is intended more for "tracking" and for "metrics" rather
> than for holding duplicates of logged error messages. At the same
> time, I felt that disabling an entire subscription due to a single
> rogue error might be overkill sometimes.

I'm happy to entertain criticism of the particulars of how my patch approaches this problem, but it is already making a distinction between transient errors (resources, network, etc.) vs. ones that are non-transient. Again, I might not have drawn the line in the right place, but the patch is not intended to disable subscriptions in response to transient errors.

> But I wonder if there is a
> way to combine those two ideas so that the Stats Collector gets some
> new counter for tracking the number of worker re-launches that have
> occurred, meanwhile there could be a subscription option which gives a
> threshold above which you would disable the subscription.
> e.g.
> "disable_on_error_threshold=0" default, relaunch forever
> "disable_on_error_threshold=1" disable upon first error encountered.
> (This is how your patch behaves now I think.)
> "disable_on_error_threshold=500" disable if the re-launch errors go
> unattended and happen 500 times.

That sounds like a misfeature to me. You could have a subscription that works fine for a month, surviving numerous short network outages, but then gets autodisabled after a longer network outage. I'm not sure why anybody would want that. You might argue for exponential backoff, where it never gets autodisabled on transient errors, but retries less frequently. But I don't want to expand the scope of this patch to include that, at least not without a lot more evidence that it is needed.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2021-06-22 02:57:59 Re: Maintaining a list of pgindent commits for "git blame" to ignore
Previous Message Thomas Munro 2021-06-22 02:48:51 Re: Use simplehash.h instead of dynahash in SMgr