Re: Optionally automatically disable logical replication subscriptions on error

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optionally automatically disable logical replication subscriptions on error
Date: 2021-06-18 06:34:47
Message-ID: CAHut+PszB2hpYOMzLXmCjPkd1s49TfmYvFYQD4JmEaWUrzkNnw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 18, 2021 at 6:18 AM Mark Dilger
<mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>
> Hackers,
>
> Logical replication apply workers for a subscription can easily get stuck in an infinite loop of attempting to apply a change, triggering an error (such as a constraint violation), exiting with an error written to the subscription worker log, and restarting.
>
> As things currently stand, only superusers can create subscriptions. Ongoing work to delegate superuser tasks to non-superusers creates the potential for even more errors to be triggered, specifically, errors where the apply worker does not have permission to make changes to the target table.
>
> The attached patch makes it possible to create a subscription using a new subscription_parameter, "disable_on_error", such that rather than going into an infinite loop, the apply worker will catch errors and automatically disable the subscription, breaking the loop. The new parameter defaults to false. When false, the PG_TRY/PG_CATCH overhead is avoided, so for subscriptions which do not use the feature, there shouldn't be any change. Users can manually clear the error after fixing the underlying issue with an ALTER SUBSCRIPTION .. ENABLE command.
>
> In addition to helping on production systems, this makes writing TAP tests involving error conditions simpler. I originally ran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker log file to determine whether permission failures were occurring and what they were. It was also obnoxiously easy to have a test get stuck waiting for a permanently stuck subscription to catch up. This helps with both issues.
>
> I don't think this is quite ready for commit, but I'd like feedback if folks like this idea or want to suggest design changes.

I tried your patch.

It applied OK (albeit with whitespace warnings).

The code build and TAP tests are all OK.

Below are a few comments and observations.

COMMENTS
========

(1) PG Docs catalogs.sgml

Documented new column "suberrmsg" but did not document the other new
columns ("disable_on_error", "disabled_by_error")?

------

(2) New column "disabled_by_error".

I wondered if there was actually any need for this column. Isn't the
same information conveyed by just having "subenabled" = false, at same
time as as non-empty "suberrmsg"? This would remove any confusion for
having 2 booleans which both indicate disabled.

------

(3) New columns "disabled_by_error", "disabled_on_error".

All other columns of the pg_subscription have a "sub" prefix.

------

(4) errhint member used?

@@ -91,12 +100,16 @@ typedef struct Subscription
char *name; /* Name of the subscription */
Oid owner; /* Oid of the subscription owner */
bool enabled; /* Indicates if the subscription is enabled */
+ bool disable_on_error; /* Whether errors automatically disable */
+ bool disabled_by_error; /* Whether an error has disabled */
bool binary; /* Indicates if the subscription wants data in
* binary format */
bool stream; /* Allow streaming in-progress transactions. */
char *conninfo; /* Connection string to the publisher */
char *slotname; /* Name of the replication slot */
char *synccommit; /* Synchronous commit setting for worker */
+ char *errmsg; /* Message from error which disabled */
+ char *errhint; /* Hint from error which disabled */
List *publications; /* List of publication names to subscribe to */
} Subscription;

I did not find any code using that newly added member "errhint".

------

(5) dump.c

i. No mention of new columns "disabled_on_error" and
"disabled_by_error". Is that right?

ii. Shouldn't the code for the "suberrmsg" be qualified with some PG
version number checks?

------

(6) Patch only handles errors only from the Apply worker.

Tablesync can give similar errors (e.g. PK violation during DATASYNC
phase) which will trigger re-launch forever regardless of the setting
of "disabled_on_error".
(confirmed by observations below)

------

(7) TAP test code

+$node_subscriber->init(allows_streaming => 'logical');

AFAIK that "logical" configuration is not necessary for the subscriber side. So,

$node_subscriber->init();

////////////

Some Experiments/Observations
==============================

In general, I found this functionality is useful and it works as
advertised by your patch comment.

======

Test: Display pg_subscription with the new columns
Observation: As expected. But some new colnames are not prefixed like
their peers.

test_sub=# \pset x
Expanded display is on.
test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+--------------------------------------------------------
oid | 16394
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | t
disable_on_error | t
disabled_by_error | f
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg |
subpublications | {tap_pub}

======

Test: Cause a PK violation during normal Apply replication (when
"disabled_on_error=true")
Observation: Apply worker stops. Subscription is disabled. Error
message is in the catalog.

2021-06-18 15:12:45.905 AEST [25904] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:12:45.905 AEST [25904] LOG: logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:12:45.905 AEST [25904] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:12:45.905 AEST [25904] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:12:45.908 AEST [19924] LOG: background worker "logical
replication worker" (PID 25904) exited with exit code 1

test_sub=# select * from pg_subscription;
-[ RECORD 1 ]-----+---------------------------------------------------------------
oid | 16394
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | f
disable_on_error | t
disabled_by_error | t
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg | duplicate key value violates unique constraint
"test_tab_pkey"
subpublications | {tap_pub}

======

Test: Try to enable subscription (without fixing the PK violation problem).
Observation. OK. It just stops again

test_sub=# alter subscription tap_sub enable;
ALTER SUBSCRIPTION
test_sub=# 2021-06-18 15:17:18.067 AEST [10228] LOG: logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:17:18.078 AEST [10228] LOG: edata is true for
subscription 'tap_sub': message = "duplicate key value violates unique
constraint "test_tab_pkey"", hint = "<NONE>"
2021-06-18 15:17:18.078 AEST [10228] LOG: logical replication apply
worker for subscription "tap_sub" will stop because the subscription
was disabled due to error
2021-06-18 15:17:18.078 AEST [10228] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:17:18.078 AEST [10228] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:17:18.079 AEST [19924] LOG: background worker "logical
replication worker" (PID 10228) exited with exit code 1

======

Test: Manually disable the subscription (which had previously already
been disabled due to error)
Observation: OK. The suberrmsg gets reset to an empty string.

alter subscription tap_sub disable;

=====

Test: Turn off the disable_on_error
Observation: As expected, now the Apply worker goes into re-launch
forever loop every time it hits PK violation

test_sub=# alter subscription tap_sub set (disable_on_error=false);
ALTER SUBSCRIPTION

...

======

Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
(when disable_on_error = true)
Observation: This patch changes nothing for this case. The Tablesyn
re-launchs in a forever loop the same as current functionality.

test_sub=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub WITH
(disable_on_error=false);
NOTICE: created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
test_sub=# 2021-06-18 15:38:19.547 AEST [18205] LOG: logical
replication apply worker for subscription "tap_sub" has started
2021-06-18 15:38:19.557 AEST [18207] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:19.610 AEST [18207] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:19.610 AEST [18207] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:19.610 AEST [18207] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:19.611 AEST [19924] LOG: background worker "logical
replication worker" (PID 18207) exited with exit code 1
2021-06-18 15:38:24.634 AEST [18369] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:24.689 AEST [18369] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:24.689 AEST [18369] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:24.689 AEST [18369] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:24.690 AEST [19924] LOG: background worker "logical
replication worker" (PID 18369) exited with exit code 1
2021-06-18 15:38:29.701 AEST [18521] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has started
2021-06-18 15:38:29.765 AEST [18521] ERROR: duplicate key value
violates unique constraint "test_tab_pkey"
2021-06-18 15:38:29.765 AEST [18521] DETAIL: Key (a)=(1) already exists.
2021-06-18 15:38:29.765 AEST [18521] CONTEXT: COPY test_tab, line 1
2021-06-18 15:38:29.766 AEST [19924] LOG: background worker "logical
replication worker" (PID 18521) exited with exit code 1
etc...

-[ RECORD 1 ]-----+--------------------------------------------------------
oid | 16399
subdbid | 16384
subname | tap_sub
subowner | 10
subenabled | t
disable_on_error | f
disabled_by_error | f
subbinary | f
substream | f
subconninfo | host=localhost dbname=test_pub application_name=tap_sub
subslotname | tap_sub
subsynccommit | off
suberrmsg |
subpublications | {tap_pub}

------
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-06-18 06:43:22 Re: Toast compression method options
Previous Message Michael Paquier 2021-06-18 05:57:18 Re: Speed up pg_checksums in cases where checksum already set