Re: Optionally automatically disable logical replication subscriptions on error

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "Smith, Peter" <peters(at)fast(dot)au(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optionally automatically disable logical replication subscriptions on error
Date: 2022-03-02 03:46:43
Message-ID: CAD21AoDPCnVpMB-nR_5S-ZsV-+c1VH2uVxJ99ZyO+H_tnO7KTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 2, 2022 at 9:34 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Please see below my review comments for v24.
>
> ======
>
> 1. src/backend/replication/logical/worker.c - start_table_sync
>
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
>
> (This review comment is just FYI in case you did not do this deliberately)
>
> FYI, you didn't really need to call am_tablesync_worker() here because
> it is already asserted for the sync phase that it MUST be a tablesync
> worker.
>
> OTOH, IMO it documents the purpose of the parm so if it was deliberate
> then that is OK too.
>
> ~~~
>
> 2. src/backend/replication/logical/worker.c - start_table_sync
>
> + PG_CATCH();
> + {
> + /*
> + * Abort the current transaction so that we send the stats message in
> + * an idle state.
> + */
> + AbortOutOfAnyTransaction();
> +
> + /* Report the worker failed during table synchronization */
> + pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());
> +
>
> [Maybe you will say that this review comment is unrelated to
> disable_on_err, but since this is a totally new/refactored function
> then I think maybe there is no problem to make this change at the same
> time. Anyway there is no function change, it is just rearranging some
> comments.]
>
> I felt the separation of those 2 statements and comments makes that
> code less clean than it could/should be. IMO they should be grouped
> together.
>
> SUGGESTED
> /*
> * Report the worker failed during table synchronization. Abort the
> * current transaction so that the stats message is sent in an idle
> * state.
> */
> AbortOutOfAnyTransaction();
> pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

After more thoughts, should we do both AbortOutOfAnyTransaction() and
error message handling while holding interrupts? That is,

HOLD_INTERRUPTS();
EmitErrorReport();
FlushErrorState();
AbortOutOfAny Transaction();
RESUME_INTERRUPTS();
pgstat_report_subscription_error(MySubscription->oid, !am_tablesync_worker());

I think it's better that we do clean up first and then do other works
such as sending the message to the stats collector and updating the
catalog.

Here are some comments on v24 patch:

+ /* Look up our subscription in the catalogs */
+ tup = SearchSysCacheCopy2(SUBSCRIPTIONNAME, MyDatabaseId,
+
CStringGetDatum(MySubscription->name));

s/catalogs/catalog/

Why don't we use SUBSCRIPTIONOID with MySubscription->oid?

---
+ if (!HeapTupleIsValid(tup))
+ ereport(ERROR,
+ errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("subscription \"%s\" does not exist",
+ MySubscription->name));

I think we should use elog() here rather than ereport() since it's a
should-not-happen error.

---
+ /* Notify the subscription will be no longer valid */

I'd suggest rephrasing it to like "Notify the subscription will be
disabled". (the subscription is still valid actually, but just
disabled).

---
+ /* Notify the subscription will be no longer valid */
+ ereport(LOG,
+ errmsg("logical replication subscription
\"%s\" will be disabled due to an error",
+ MySubscription->name));
+

I think we can report the log at the end of this function rather than
during the transaction.

---
+my $cmd = qq(
+CREATE TABLE tbl (i INT);
+ALTER TABLE tbl REPLICA IDENTITY FULL;
+CREATE INDEX tbl_idx ON tbl(i));

I think we don't need to set REPLICA IDENTITY FULL to this table since
there is notupdate/delete.

Do we need tbl_idx?

---
+$cmd = qq(
+SELECT COUNT(1) = 1 FROM pg_catalog.pg_subscription_rel sr
+WHERE sr.srsubstate IN ('s', 'r'));
+$node_subscriber->poll_query_until('postgres', $cmd);

It seems better to add a condition of srrelid just in case.

---
+$cmd = qq(
+SELECT count(1) = 1 FROM pg_catalog.pg_subscription s
+WHERE s.subname = 'sub' AND s.subenabled IS FALSE);
+$node_subscriber->poll_query_until('postgres', $cmd)
+ or die "Timed out while waiting for subscriber to be disabled";

I think that it's more natural to directly check the subscription's
subenabled. For example:

SELECT subenabled = false FROM pg_subscription WHERE subname = 'sub';

---
+$cmd = q(ALTER SUBSCRIPTION sub ENABLE);
+$node_subscriber->safe_psql('postgres', $cmd);
+$cmd = q(SELECT COUNT(1) = 3 FROM tbl WHERE i = 3);
+$node_subscriber->poll_query_until('postgres', $cmd)
+ or die "Timed out while waiting for applying";

I think it's better to wait for the subscriber to catch up and check
the query result instead of using poll_query_until() so that we can
check the query result in case where the test fails.

---
+$cmd = qq(DROP INDEX tbl_unique);
+$node_subscriber->safe_psql('postgres', $cmd);

In the newly added tap tests, all queries are consistently assigned to
$cmd and executed even when the query is used only once. It seems a
different style from the one in other tap tests. Is there any reason
why we do this style for all queries in the tap tests?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2022-03-02 04:02:45 Re: Add the replication origin name and commit-LSN to logical replication worker errcontext
Previous Message vignesh C 2022-03-02 03:26:57 Re: Handle infinite recursion in logical replication setup