RE: Optionally automatically disable logical replication subscriptions on error

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Peter Smith' <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "Smith, Peter" <peters(at)fast(dot)au(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Optionally automatically disable logical replication subscriptions on error
Date: 2022-02-22 04:11:32
Message-ID: TYCPR01MB8373C507735DA618084258F0ED3B9@TYCPR01MB8373.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, February 22, 2022 7:53 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> On Mon, Feb 21, 2022 at 11:44 PM osumi(dot)takamichi(at)fujitsu(dot)com
> <osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
> >
> > On Monday, February 21, 2022 2:56 PM Peter Smith
> <smithpb2250(at)gmail(dot)com> wrote:
> > > Thanks for addressing my previous comments. Now I have looked at v19.
> > >
> > > On Mon, Feb 21, 2022 at 11:25 AM osumi(dot)takamichi(at)fujitsu(dot)com
> > > <osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
> > > >
> > > > On Friday, February 18, 2022 3:27 PM Peter Smith
> > > <smithpb2250(at)gmail(dot)com> wrote:
> > > > > Hi. Below are my code review comments for v18.
> > > > Thank you for your review !
> > > ...
> > > > > 5. src/backend/replication/logical/worker.c -
> > > > > DisableSubscriptionOnError
> > > > >
> > > > > + /*
> > > > > + * We would not be here unless this subscription's disableonerr
> > > > > + field was
> > > > > + * true when our worker began applying changes, but check
> > > > > + whether that
> > > > > + * field has changed in the interim.
> > > > > + */
> > > > >
> > > > > Apparently, this function might just do nothing if it detects
> > > > > some situation where the flag was changed somehow, but I'm not
> > > > > 100% sure that the callers are properly catering for when nothing
> happens.
> > > > >
> > > > > IMO it would be better if this function would return true/false
> > > > > to mean "did disable subscription happen or not?" because that
> > > > > will give the calling code the chance to check the function
> > > > > return and do the right thing - e.g. if the caller first thought
> > > > > it should be disabled but then
> > > it turned out it did NOT disable...
> > > > I don't think we need to do something more.
> > > > After this function, table sync worker and the apply worker just exit.
> > > > IMO, we don't need to do additional work for already-disabled
> > > > subscription on the caller sides.
> > > > It should be sufficient to fulfill the purpose of
> > > > DisableSubscriptionOnError or confirm it has been fulfilled.
> > >
> > > Hmmm - Yeah, it may be the workers might just exit soon after
> > > anyhow as you say so everything comes out in the wash, but still, I
> > > felt for this case when DisableSubscriptionOnError turned out to do
> > > nothing it would be better to exit via the existing logic. And that is easy to do
> if the function returns true/false.
> > >
> > > For example, changes like below seemed neater code to me. YMMV.
> > >
> > > BEFORE (SyncTableStartWrapper):
> > > if (MySubscription->disableonerr)
> > > {
> > > DisableSubscriptionOnError();
> > > proc_exit(0);
> > > }
> > > AFTER
> > > if (MySubscription->disableonerr && DisableSubscriptionOnError())
> > > proc_exit(0);
> > >
> > > BEFORE (ApplyLoopWrapper)
> > > if (MySubscription->disableonerr)
> > > {
> > > /* Disable the subscription */
> > > DisableSubscriptionOnError();
> > > return;
> > > }
> > > AFTER
> > > if (MySubscription->disableonerr && DisableSubscriptionOnError())
> > > return;
> > Okay, so this return value works for better readability.
> > Fixed.
> >
> >
> > > ~~~
> > >
> > > Here are a couple more comments:
> > >
> > > 1. src/backend/replication/logical/worker.c -
> > > DisableSubscriptionOnError, Refactor error handling
> > >
> > > (this comment assumes the above gets changed too)
> > I think those are independent.
>
> OK. I was only curious if the change #5 above might cause the error to be logged
> 2x, if the DisableSubscriptionOnError returns false.
> - firstly, when it logs errors within the function
> - secondly, by normal error mechanism when the caller re-throws it.
>
> But, if you are sure that won't happen then it is good news.
I didn't feel this would become a substantial issue.

When we alter subscription with disable_on_error = false
after we go into the DisableSubscriptionOnError,
we don't disable the subscription in the same function.
That means we launch new apply workers repeatedly after that
until we solve the error cause or we set the disable_on_error = true again.

So, if we confirm that the disable_on_error = false in the DisableSubscriptionOnError,
it's highly possible that we'll get more same original errors from new apply workers.

This leads to another question, we should suppress the 2nd error(if there is),
even when it's highly possible that we'll get more same errors by new apply workers
created repeatedly or not. I wasn't sure if the implementation complexity for this wins the log print.

So, kindly let me keep the current code as is.
If someone wants share his/her opinion on this, please let me know.

> >
> >
> > > +static void
> > > +DisableSubscriptionOnError(void)
> > > +{
> > > + Relation rel;
> > > + bool nulls[Natts_pg_subscription]; bool
> > > +replaces[Natts_pg_subscription]; Datum
> > > +values[Natts_pg_subscription]; HeapTuple tup;
> > > +Form_pg_subscription subform;
> > > +
> > > + /* Emit the error */
> > > + EmitErrorReport();
> > > + /* Abort any active transaction */ AbortOutOfAnyTransaction();
> > > + /* Reset the ErrorContext */
> > > + FlushErrorState();
> > > +
> > > + /* Disable the subscription in a fresh transaction */
> > > + StartTransactionCommand();
> > >
> > > If this DisableSubscriptionOnError function decides later that
> > > actually the 'disableonerr' flag is false (i.e. it's NOT going to
> > > disable the subscription after
> > > all) then IMO it make more sense that the error logging for that
> > > case should just do whatever it is doing now by the normal error processing
> mechanism.
> > >
> > > In other words, I thought perhaps the code to
> > > EmitErrorReport/FlushError state etc be moved to be BELOW the if
> > > (!subform->subdisableonerr) bail-out code?
> > >
> > > Please see what you think in my attached POC [1]. It seems neater to
> > > me, and tests are all OK. Maybe I am mistaken...
> > I had a concern that this order change of codes would have a negative
> > impact when we have another new error during the call of
> DisableSubscriptionOnError.
> >
> > With the debugger, I raised an error in this function before emitting the
> original error.
> > As a result, the original error that makes the apply worker go into
> > the path of DisableSubscriptionOnError (in my test, duplication error) has
> vanished.
> > In this sense, v19 looks safer, and the current order to handle error
> > recovery first looks better to me.
> >
> > FYI, after the 2nd debugger error,
> > the next new apply worker created quickly met the same type of error,
> > went into the same path, and disabled the subscription with the log.
> > But, it won't be advisable to let the possibility left.
>
> OK - thanks for checking it.
>
> Will it be better to put some comments about that? Something like --
>
> BEFORE
> /* Emit the error */
> EmitErrorReport();
> /* Abort any active transaction */
> AbortOutOfAnyTransaction();
> /* Reset the ErrorContext */
> FlushErrorState();
>
> /* Disable the subscription in a fresh transaction */
> StartTransactionCommand();
>
> AFTER
> /* Disable the subscription in a fresh transaction */
> AbortOutOfAnyTransaction(); StartTransactionCommand();
>
> /*
> * Log the error that caused DisableSubscriptionOnError to be called.
> * We do this immediately so that it won't be lost if some other internal
> * error occurs in the following code,
> */
> EmitErrorReport();
> FlushErrorState();
I appreciate your suggestion. Yet, I'd like to keep the current order of my patch.
The FlushErrorState's comment mentions we are not out of the error subsystem until
we call this and starting a new transaction before it didn't sound a good idea.
But, I've fixed the comments around this. The indentation for new comments
are checked by pgindent.

Best Regards,
Takamichi Osumi

Attachment Content-Type Size
v21-0001-Optionally-disable-subscriptions-on-error.patch application/octet-stream 51.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2022-02-22 04:19:01 Re: logical replication empty transactions
Previous Message Fujii Masao 2022-02-22 03:57:44 Re: [Proposal] Add foreign-server health checks infrastructure