Re: Optionally automatically disable logical replication subscriptions on error

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "Smith, Peter" <peters(at)fast(dot)au(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optionally automatically disable logical replication subscriptions on error
Date: 2022-03-06 23:25:21
Message-ID: CAHut+PtvnHOzDqnXxW72JWz4Z-0ecYrwat2MyOmB1NZZD_GP1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 4, 2022 at 5:55 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Wed, Mar 2, 2022 at 6:38 PM osumi(dot)takamichi(at)fujitsu(dot)com
> <osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
> >
> > On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > After more thoughts, should we do both AbortOutOfAnyTransaction() and error
> > > message handling while holding interrupts? That is,
> > >
> > > HOLD_INTERRUPTS();
> > > EmitErrorReport();
> > > FlushErrorState();
> > > AbortOutOfAny Transaction();
> > > RESUME_INTERRUPTS();
> > > pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> > > er());
> > >
> > > I think it's better that we do clean up first and then do other works such as
> > > sending the message to the stats collector and updating the catalog.
> > I agree. Fixed. Along with this change, I corrected the header comment of
> > DisableSubscriptionOnError, too.
> >
> >
> > > Here are some comments on v24 patch:
> > >
> > > + /* Look up our subscription in the catalogs */
> > > + tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
> > > +
> > > CStringGetDatum(MySubscription->name));
> > >
> > > s/catalogs/catalog/
> > >
> > > Why don't we use SUBSCRIPTIONOID with MySubscription->oid?
> > Changed.
> >
> >
> > > ---
> > > + if (!HeapTupleIsValid(tup))
> > > + ereport(ERROR,
> > > + errcode(ERRCODE_UNDEFINED_OBJECT),
> > > + errmsg("subscription \"%s\" does not
> > > exist",
> > > + MySubscription->name));
> > >
> > > I think we should use elog() here rather than ereport() since it's a
> > > should-not-happen error.
> > Fixed
> >
> >
> > > ---
> > > + /* Notify the subscription will be no longer valid */
> > >
> > > I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
> > > subscription is still valid actually, but just disabled).
> > Fixed. Also, I've made this sentence past one, because of the code place
> > change below.
> >
> >
> > > ---
> > > + /* Notify the subscription will be no longer valid */
> > > + ereport(LOG,
> > > + errmsg("logical replication subscription
> > > \"%s\" will be disabled due to an error",
> > > + MySubscription->name));
> > > +
> > >
> > > I think we can report the log at the end of this function rather than during the
> > > transaction.
> > Fixed. In this case, I needed to adjust the comment to indicate the processing
> > to disable the sub has *completed* as well.
> >
> > > ---
> > > +my $cmd = qq(
> > > +CREATE TABLE tbl (i INT);
> > > +ALTER TABLE tbl REPLICA IDENTITY FULL;
> > > +CREATE INDEX tbl_idx ON tbl(i));
> > >
> > > I think we don't need to set REPLICA IDENTITY FULL to this table since there is
> > > notupdate/delete.
> > >
> > > Do we need tbl_idx?
> > Removed both the replica identity and tbl_idx;
> >
> >
> > > ---
> > > +$cmd = qq(
> > > +SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
> > > +sr.srsubstate IN ('s', 'r'));
> > > +$node_subscriber->poll_query_until('postgres', $cmd);
> > >
> > > It seems better to add a condition of srrelid just in case.
> > Makes sense. Fixed.
> >
> >
> > > ---
> > > +$cmd = qq(
> > > +SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
> > > s.subname =
> > > +'sub' AND s.subenabled IS FALSE);
> > > +$node_subscriber->poll_query_until('postgres', $cmd)
> > > + or die "Timed out while waiting for subscriber to be disabled";
> > >
> > > I think that it's more natural to directly check the subscription's subenabled.
> > > For example:
> > >
> > > SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';
> > Fixed. I modified another code similar to this for tablesync error as well.
> >
> >
> > > ---
> > > +$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
> > > +$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
> > > COUNT(1)
> > > += 3 FROM tbl WHERE i = 3);
> > > +$node_subscriber->poll_query_until('postgres', $cmd)
> > > + or die "Timed out while waiting for applying";
> > >
> > > I think it's better to wait for the subscriber to catch up and check the query
> > > result instead of using poll_query_until() so that we can check the query result
> > > in case where the test fails.
> > I modified the code to wait for the subscriber and deleted poll_query_until.
> > Also, when I consider the test failure for this test as you mentioned,
> > expecting and checking the number of return value that equals 3
> > would be better. So, I expressed this point in my test as well,
> > according to your advice.
> >
> >
> > > ---
> > > +$cmd = qq(DROP INDEX tbl_unique);
> > > +$node_subscriber->safe_psql('postgres', $cmd);
> > >
> > > In the newly added tap tests, all queries are consistently assigned to $cmd and
> > > executed even when the query is used only once. It seems a different style from
> > > the one in other tap tests. Is there any reason why we do this style for all queries
> > > in the tap tests?
> > Fixed. I removed the 'cmd' variable itself.
> >
> >
> > Attached an updated patch v26.
>
> Thank you for updating the patch.
>
> Here are some comments on v26 patch:
>
> +/*
> + * Disable the current subscription.
> + */
> +static void
> +DisableSubscriptionOnError(void)
>
> This function now just updates the pg_subscription catalog so can we
> move it to pg_subscritpion.c while having this function accept the
> subscription OID to disable? If you agree, the function comment will
> also need to be updated.
>
> ---
> + /*
> + * First, ensure that we log the error message so
> that it won't be
> + * lost if some other internal error occurs in the
> following code.
> + * Then, abort the current transaction and send the
> stats message of
> + * the table synchronization failure in an idle state.
> + */
> + HOLD_INTERRUPTS();
> + EmitErrorReport();
> + FlushErrorState();
> + AbortOutOfAnyTransaction();
> + RESUME_INTERRUPTS();
> + pgstat_report_subscription_error(MySubscription->oid, false);
> +
> + if (MySubscription->disableonerr)
> + {
> + DisableSubscriptionOnError();
> + proc_exit(0);
> + }
> +
> + PG_RE_THROW();
>
> If the disableonerr is false, the same error is reported twice. Also,
> the code in PG_CATCH() in both start_apply() and start_table_sync()
> are almost the same. Can we create a common function to do post-error
> processing?
>
> The worker should exit with return code 1.
>
> I've attached a fixup patch for changes to worker.c for your
> reference. Feel free to adopt the changes.

The way that common function is implemented required removal of the
existing PG_RE_THROW logic, which in turn was only possible using
special knowledge that this just happens to be the last try/catch
block for the apply worker. Yes, I believe everything will work ok,
but it just seemed like a step too far for me to change the throw
logic. I feel that once you get to the point of having to write
special comments in the code to explain "why we can get away with
doing this..." then that is an indication that perhaps it's not really
the best way...

Is there some alternative way to share common code, but without having
to change the existing throw error logic to do so?

OTOH, maybe others think it is ok?

------
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-03-06 23:27:43 Re: Granting SET and ALTER SYSTE privileges for GUCs
Previous Message Victor Yegorov 2022-03-06 23:12:07 Re: ltree_gist indexes broken after pg_upgrade from 12 to 13