Re: Optionally automatically disable logical replication subscriptions on error

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "Smith, Peter" <peters(at)fast(dot)au(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optionally automatically disable logical replication subscriptions on error
Date: 2022-02-21 05:55:43
Message-ID: CAHut+Pt95poJxTCeCaP-dmVBuTk405UkDCa3jts7uLBmxzY=vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for addressing my previous comments. Now I have looked at v19.

On Mon, Feb 21, 2022 at 11:25 AM osumi(dot)takamichi(at)fujitsu(dot)com
<osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
>
> On Friday, February 18, 2022 3:27 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > Hi. Below are my code review comments for v18.
> Thank you for your review !
...
> > 5. src/backend/replication/logical/worker.c - DisableSubscriptionOnError
> >
> > + /*
> > + * We would not be here unless this subscription's disableonerr field
> > + was
> > + * true when our worker began applying changes, but check whether that
> > + * field has changed in the interim.
> > + */
> >
> > Apparently, this function might just do nothing if it detects some situation
> > where the flag was changed somehow, but I'm not 100% sure that the callers
> > are properly catering for when nothing happens.
> >
> > IMO it would be better if this function would return true/false to mean "did
> > disable subscription happen or not?" because that will give the calling code the
> > chance to check the function return and do the right thing - e.g. if the caller first
> > thought it should be disabled but then it turned out it did NOT disable...
> I don't think we need to do something more.
> After this function, table sync worker and the apply worker
> just exit. IMO, we don't need to do additional work for
> already-disabled subscription on the caller sides.
> It should be sufficient to fulfill the purpose of
> DisableSubscriptionOnError or confirm it has been fulfilled.

Hmmm - Yeah, it may be the workers might just exit soon after anyhow
as you say so everything comes out in the wash, but still, I felt for
this case when DisableSubscriptionOnError turned out to do nothing it
would be better to exit via the existing logic. And that is easy to do
if the function returns true/false.

For example, changes like below seemed neater code to me. YMMV.

BEFORE (SyncTableStartWrapper):
if (MySubscription->disableonerr)
{
DisableSubscriptionOnError();
proc_exit(0);
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
proc_exit(0);

BEFORE (ApplyLoopWrapper)
if (MySubscription->disableonerr)
{
/* Disable the subscription */
DisableSubscriptionOnError();
return;
}
AFTER
if (MySubscription->disableonerr && DisableSubscriptionOnError())
return;

~~~

Here are a couple more comments:

1. src/backend/replication/logical/worker.c -
DisableSubscriptionOnError, Refactor error handling

(this comment assumes the above gets changed too)

+static void
+DisableSubscriptionOnError(void)
+{
+ Relation rel;
+ bool nulls[Natts_pg_subscription];
+ bool replaces[Natts_pg_subscription];
+ Datum values[Natts_pg_subscription];
+ HeapTuple tup;
+ Form_pg_subscription subform;
+
+ /* Emit the error */
+ EmitErrorReport();
+ /* Abort any active transaction */
+ AbortOutOfAnyTransaction();
+ /* Reset the ErrorContext */
+ FlushErrorState();
+
+ /* Disable the subscription in a fresh transaction */
+ StartTransactionCommand();

If this DisableSubscriptionOnError function decides later that
actually the 'disableonerr' flag is false (i.e. it's NOT going to
disable the subscription after all) then IMO it make more sense that
the error logging for that case should just do whatever it is doing
now by the normal error processing mechanism.

In other words, I thought perhaps the code to
EmitErrorReport/FlushError state etc be moved to be BELOW the if
(!subform->subdisableonerr) bail-out code?

Please see what you think in my attached POC [1]. It seems neater to
me, and tests are all OK. Maybe I am mistaken...

~~~

2. Commit message - wording

Logical replication apply workers for a subscription can easily get
stuck in an infinite loop of attempting to apply a change,
triggering an error (such as a constraint violation), exiting with
an error written to the subscription worker log, and restarting.

SUGGESTION
"exiting with an error written" --> "exiting with the error written"

------
[1] peter-v19-poc.diff - POC just to try some of my suggestions above
to make sure all tests still pass ok.

Kind Regards,
Peter Smith.
Fujitsu Australia.

Attachment Content-Type Size
peter-v19-poc.diff.txt text/plain 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2022-02-21 07:00:00 Re: Assert in pageinspect with NULL pages
Previous Message Etsuro Fujita 2022-02-21 05:55:07 Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit