Re: Optionally automatically disable logical replication subscriptions on error

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Optionally automatically disable logical replication subscriptions on error
Date: 2021-06-19 00:03:45
Message-ID: 38706E50-9190-4623-9FD7-4E9E3DD3F81A@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Jun 17, 2021, at 11:34 PM, Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> I tried your patch.

Thanks for the quick and thorough review!

> (2) New column "disabled_by_error".
>
> I wondered if there was actually any need for this column. Isn't the
> same information conveyed by just having "subenabled" = false, at same
> time as as non-empty "suberrmsg"? This would remove any confusion for
> having 2 booleans which both indicate disabled.

Yeah, I wondered about that before posting v1. I removed the disabled_by_error field for v2.

> (3) New columns "disabled_by_error", "disabled_on_error".
>
> All other columns of the pg_subscription have a "sub" prefix.

I don't feel strongly about this. How about "subdisableonerr"? I used that in v2.

> I did not find any code using that newly added member "errhint".

Thanks for catching that. I had tried to remove all references to "errhint" before posting v1. The original idea was that both the message and hint of the error would be kept, but in testing I found the hint field was typically empty, so I removed it. Sorry that I left one mention of it lying around.

> (5) dump.c

I didn't bother getting pg_dump working before posting v1, and I still have not done so, as I mainly want to solicit feedback on whether the basic direction I am going will work for the community.

> (6) Patch only handles errors only from the Apply worker.
>
> Tablesync can give similar errors (e.g. PK violation during DATASYNC
> phase) which will trigger re-launch forever regardless of the setting
> of "disabled_on_error".
> (confirmed by observations below)

Yes, this is a good point, and also mentioned by Amit. I have fixed it in v2 and adjusted the regression test to trigger an automatic disabling for initial table sync as well as for change replication.

> 2021-06-18 15:12:45.905 AEST [25904] LOG: edata is true for
> subscription 'tap_sub': message = "duplicate key value violates unique
> constraint "test_tab_pkey"", hint = "<NONE>"

You didn't call this out, but FYI, I don't intend to leave this particular log message in the patch. It was for development only. I have removed it for v2 and have added a different log message much sooner after catching the error, to avoid squashing the error in case some other action fails.

The regression test shows this, if you open tmp_check/log/022_disable_on_error_subscriber.log:

2021-06-18 16:25:20.138 PDT [56926] LOG: logical replication subscription "s1" will be disabled due to error: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] ERROR: duplicate key value violates unique constraint "s1_tbl_unique"
2021-06-18 16:25:20.139 PDT [56926] DETAIL: Key (i)=(1) already exists.
2021-06-18 16:25:20.139 PDT [56926] CONTEXT: COPY tbl, line 2

The first line logs the error prior to attempting to disable the subscription, and the next three lines are due to rethrowing the error after committing the successful disabling of the subscription. If the attempt to disable the subscription itself throws, these additional three lines won't show up, but the first one should. Amit mentioned this upthread. Do you think this will be ok, or would you like to also have a suberrdetail field so that the detail doesn't get lost? I haven't added such an extra field, and am inclined to think it would be excessive, but maybe others feel differently?

> ======
>
> Test: Cause a PK violation in the Tablesync copy (DATASYNC) phase.
> (when disable_on_error = true)
> Observation: This patch changes nothing for this case. The Tablesyn
> re-launchs in a forever loop the same as current functionality.

In v2, tablesync copy errors should also be caught. The test has been extended to cover this also.

Attachment Content-Type Size
v2-0001-Optionally-disabling-subscriptions-on-error.patch application/octet-stream 31.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-06-19 00:49:59 Re: pgbench logging broken by time logic changes
Previous Message Greg Sabino Mullane 2021-06-19 00:01:17 Re: Speed up pg_checksums in cases where checksum already set