Re: Optionally automatically disable logical replication subscriptions on error

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "Smith, Peter" <peters(at)fast(dot)au(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optionally automatically disable logical replication subscriptions on error
Date: 2022-03-04 06:55:00
Message-ID: CAD21AoBMxAcbjwrE5Xa9AziXLVMODL4taAGhdEUw+-5AdzDTuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 2, 2022 at 6:38 PM osumi(dot)takamichi(at)fujitsu(dot)com
<osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
>
> On Wednesday, March 2, 2022 12:47 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > After more thoughts, should we do both AbortOutOfAnyTransaction() and error
> > message handling while holding interrupts? That is,
> >
> > HOLD_INTERRUPTS();
> > EmitErrorReport();
> > FlushErrorState();
> > AbortOutOfAny Transaction();
> > RESUME_INTERRUPTS();
> > pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_work
> > er());
> >
> > I think it's better that we do clean up first and then do other works such as
> > sending the message to the stats collector and updating the catalog.
> I agree. Fixed. Along with this change, I corrected the header comment of
> DisableSubscriptionOnError, too.
>
>
> > Here are some comments on v24 patch:
> >
> > + /* Look up our subscription in the catalogs */
> > + tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
> > +
> > CStringGetDatum(MySubscription->name));
> >
> > s/catalogs/catalog/
> >
> > Why don't we use SUBSCRIPTIONOID with MySubscription->oid?
> Changed.
>
>
> > ---
> > + if (!HeapTupleIsValid(tup))
> > + ereport(ERROR,
> > + errcode(ERRCODE_UNDEFINED_OBJECT),
> > + errmsg("subscription \"%s\" does not
> > exist",
> > + MySubscription->name));
> >
> > I think we should use elog() here rather than ereport() since it's a
> > should-not-happen error.
> Fixed
>
>
> > ---
> > + /* Notify the subscription will be no longer valid */
> >
> > I'd suggest rephrasing it to like "Notify the subscription will be disabled". (the
> > subscription is still valid actually, but just disabled).
> Fixed. Also, I've made this sentence past one, because of the code place
> change below.
>
>
> > ---
> > + /* Notify the subscription will be no longer valid */
> > + ereport(LOG,
> > + errmsg("logical replication subscription
> > \"%s\" will be disabled due to an error",
> > + MySubscription->name));
> > +
> >
> > I think we can report the log at the end of this function rather than during the
> > transaction.
> Fixed. In this case, I needed to adjust the comment to indicate the processing
> to disable the sub has *completed* as well.
>
> > ---
> > +my $cmd = qq(
> > +CREATE TABLE tbl (i INT);
> > +ALTER TABLE tbl REPLICA IDENTITY FULL;
> > +CREATE INDEX tbl_idx ON tbl(i));
> >
> > I think we don't need to set REPLICA IDENTITY FULL to this table since there is
> > notupdate/delete.
> >
> > Do we need tbl_idx?
> Removed both the replica identity and tbl_idx;
>
>
> > ---
> > +$cmd = qq(
> > +SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr WHERE
> > +sr.srsubstate IN ('s', 'r'));
> > +$node_subscriber->poll_query_until('postgres', $cmd);
> >
> > It seems better to add a condition of srrelid just in case.
> Makes sense. Fixed.
>
>
> > ---
> > +$cmd = qq(
> > +SELECT count(1) = 1 FROM pg_catalog.pg_subscription s WHERE
> > s.subname =
> > +'sub' AND s.subenabled IS FALSE);
> > +$node_subscriber->poll_query_until('postgres', $cmd)
> > + or die "Timed out while waiting for subscriber to be disabled";
> >
> > I think that it's more natural to directly check the subscription's subenabled.
> > For example:
> >
> > SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';
> Fixed. I modified another code similar to this for tablesync error as well.
>
>
> > ---
> > +$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
> > +$node_subscriber->safe_psql('postgres', $cmd); $cmd = q(SELECT
> > COUNT(1)
> > += 3 FROM tbl WHERE i = 3);
> > +$node_subscriber->poll_query_until('postgres', $cmd)
> > + or die "Timed out while waiting for applying";
> >
> > I think it's better to wait for the subscriber to catch up and check the query
> > result instead of using poll_query_until() so that we can check the query result
> > in case where the test fails.
> I modified the code to wait for the subscriber and deleted poll_query_until.
> Also, when I consider the test failure for this test as you mentioned,
> expecting and checking the number of return value that equals 3
> would be better. So, I expressed this point in my test as well,
> according to your advice.
>
>
> > ---
> > +$cmd = qq(DROP INDEX tbl_unique);
> > +$node_subscriber->safe_psql('postgres', $cmd);
> >
> > In the newly added tap tests, all queries are consistently assigned to $cmd and
> > executed even when the query is used only once. It seems a different style from
> > the one in other tap tests. Is there any reason why we do this style for all queries
> > in the tap tests?
> Fixed. I removed the 'cmd' variable itself.
>
>
> Attached an updated patch v26.

Thank you for updating the patch.

Here are some comments on v26 patch:

+/*
+ * Disable the current subscription.
+ */
+static void
+DisableSubscriptionOnError(void)

This function now just updates the pg_subscription catalog so can we
move it to pg_subscritpion.c while having this function accept the
subscription OID to disable? If you agree, the function comment will
also need to be updated.

---
+ /*
+ * First, ensure that we log the error message so
that it won't be
+ * lost if some other internal error occurs in the
following code.
+ * Then, abort the current transaction and send the
stats message of
+ * the table synchronization failure in an idle state.
+ */
+ HOLD_INTERRUPTS();
+ EmitErrorReport();
+ FlushErrorState();
+ AbortOutOfAnyTransaction();
+ RESUME_INTERRUPTS();
+ pgstat_report_subscription_error(MySubscription->oid, false);
+
+ if (MySubscription->disableonerr)
+ {
+ DisableSubscriptionOnError();
+ proc_exit(0);
+ }
+
+ PG_RE_THROW();

If the disableonerr is false, the same error is reported twice. Also,
the code in PG_CATCH() in both start_apply() and start_table_sync()
are almost the same. Can we create a common function to do post-error
processing?

The worker should exit with return code 1.

I've attached a fixup patch for changes to worker.c for your
reference. Feel free to adopt the changes.

---
+
+# Confirm that we have finished the table sync.
+is( $node_subscriber->safe_psql(
+ 'postgres', qq(SELECT MAX(i), COUNT(*) FROM tbl)),
+ "1|3",
+ "subscription sub replicated data");
+

Can we store the result to a local variable to check? I think it's
more consistent with other tap tests.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachment Content-Type Size
0001-fixup-Optionally-disable-subscriptions-on-error.patch application/octet-stream 8.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-03-04 07:19:32 Re: wal_compression=zstd
Previous Message Michael Paquier 2022-03-04 06:44:22 pg_tablespace_location() failure with allow_in_place_tablespaces