RE: Optionally automatically disable logical replication subscriptions on error

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Peter Smith' <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "Smith, Peter" <peters(at)fast(dot)au(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Optionally automatically disable logical replication subscriptions on error
Date: 2022-02-21 12:44:04
Message-ID: TYCPR01MB83732F134B52B5A189EEABA8ED3A9@TYCPR01MB8373.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Monday, February 21, 2022 2:56 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> Thanks for addressing my previous comments. Now I have looked at v19.
>
> On Mon, Feb 21, 2022 at 11:25 AM osumi(dot)takamichi(at)fujitsu(dot)com
> <osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
> >
> > On Friday, February 18, 2022 3:27 PM Peter Smith
> <smithpb2250(at)gmail(dot)com> wrote:
> > > Hi. Below are my code review comments for v18.
> > Thank you for your review !
> ...
> > > 5. src/backend/replication/logical/worker.c -
> > > DisableSubscriptionOnError
> > >
> > > + /*
> > > + * We would not be here unless this subscription's disableonerr
> > > + field was
> > > + * true when our worker began applying changes, but check whether
> > > + that
> > > + * field has changed in the interim.
> > > + */
> > >
> > > Apparently, this function might just do nothing if it detects some
> > > situation where the flag was changed somehow, but I'm not 100% sure
> > > that the callers are properly catering for when nothing happens.
> > >
> > > IMO it would be better if this function would return true/false to
> > > mean "did disable subscription happen or not?" because that will
> > > give the calling code the chance to check the function return and do
> > > the right thing - e.g. if the caller first thought it should be disabled but then
> it turned out it did NOT disable...
> > I don't think we need to do something more.
> > After this function, table sync worker and the apply worker just exit.
> > IMO, we don't need to do additional work for already-disabled
> > subscription on the caller sides.
> > It should be sufficient to fulfill the purpose of
> > DisableSubscriptionOnError or confirm it has been fulfilled.
>
> Hmmm - Yeah, it may be the workers might just exit soon after anyhow as you
> say so everything comes out in the wash, but still, I felt for this case when
> DisableSubscriptionOnError turned out to do nothing it would be better to exit
> via the existing logic. And that is easy to do if the function returns true/false.
>
> For example, changes like below seemed neater code to me. YMMV.
>
> BEFORE (SyncTableStartWrapper):
> if (MySubscription->disableonerr)
> {
> DisableSubscriptionOnError();
> proc_exit(0);
> }
> AFTER
> if (MySubscription->disableonerr && DisableSubscriptionOnError())
> proc_exit(0);
>
> BEFORE (ApplyLoopWrapper)
> if (MySubscription->disableonerr)
> {
> /* Disable the subscription */
> DisableSubscriptionOnError();
> return;
> }
> AFTER
> if (MySubscription->disableonerr && DisableSubscriptionOnError()) return;
Okay, so this return value works for better readability.
Fixed.


> ~~~
>
> Here are a couple more comments:
>
> 1. src/backend/replication/logical/worker.c - DisableSubscriptionOnError,
> Refactor error handling
>
> (this comment assumes the above gets changed too)
I think those are independent.

> +static void
> +DisableSubscriptionOnError(void)
> +{
> + Relation rel;
> + bool nulls[Natts_pg_subscription];
> + bool replaces[Natts_pg_subscription];
> + Datum values[Natts_pg_subscription];
> + HeapTuple tup;
> + Form_pg_subscription subform;
> +
> + /* Emit the error */
> + EmitErrorReport();
> + /* Abort any active transaction */
> + AbortOutOfAnyTransaction();
> + /* Reset the ErrorContext */
> + FlushErrorState();
> +
> + /* Disable the subscription in a fresh transaction */
> + StartTransactionCommand();
>
> If this DisableSubscriptionOnError function decides later that actually the
> 'disableonerr' flag is false (i.e. it's NOT going to disable the subscription after
> all) then IMO it make more sense that the error logging for that case should just
> do whatever it is doing now by the normal error processing mechanism.
>
> In other words, I thought perhaps the code to EmitErrorReport/FlushError state
> etc be moved to be BELOW the if
> (!subform->subdisableonerr) bail-out code?
>
> Please see what you think in my attached POC [1]. It seems neater to me, and
> tests are all OK. Maybe I am mistaken...
I had a concern that this order change of codes would have a negative
impact when we have another new error during the call of DisableSubscriptionOnError.

With the debugger, I raised an error in this function before emitting the original error.
As a result, the original error that makes the apply worker go into the path of
DisableSubscriptionOnError (in my test, duplication error) has vanished.
In this sense, v19 looks safer, and the current order to handle error recovery first
looks better to me.

FYI, after the 2nd debugger error,
the next new apply worker created quickly met the same type of error,
went into the same path, and disabled the subscription with the log.
But, it won't be advisable to let the possibility left.

> ~~~
>
> 2. Commit message - wording
>
> Logical replication apply workers for a subscription can easily get stuck in an
> infinite loop of attempting to apply a change, triggering an error (such as a
> constraint violation), exiting with an error written to the subscription worker log,
> and restarting.
>
> SUGGESTION
> "exiting with an error written" --> "exiting with the error written"
Fixed.


> ------
> [1] peter-v19-poc.diff - POC just to try some of my suggestions above to make
> sure all tests still pass ok.
Thanks ! I included you as co-author, because
you shared meaningful patches for me.

Best Regards,
Takamichi Osumi

Attachment Content-Type Size
v20-0001-Optionally-disable-subscriptions-on-error.patch application/octet-stream 51.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2022-02-21 12:49:09 Re: initdb / bootstrap design
Previous Message Euler Taveira 2022-02-21 12:09:12 speed up a logical replica setup