RE: Optionally automatically disable logical replication subscriptions on error

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, "Smith, Peter" <peters(at)fast(dot)au(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Optionally automatically disable logical replication subscriptions on error
Date: 2021-12-03 13:20:35
Message-ID: TYCPR01MB83732011D83A027C32A2802CED6A9@TYCPR01MB8373.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thursday, December 2, 2021 4:41 PM I wrote:
> On Thursday, December 2, 2021 1:49 PM Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Thu, Dec 2, 2021 at 6:35 AM osumi(dot)takamichi(at)fujitsu(dot)com
> > <osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
> > >
> > > On Wednesday, December 1, 2021 10:16 PM Amit Kapila
> > <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > Updated the patch to include the notification.
> > >
> > The patch disables the subscription for non-transient errors. I am not
> > sure if we can easily make the call to decide whether any particular
> > error is transient or not. For example, DISK_FULL or OUT_OF_MEMORY
> might not rectify itself.
> > Why not just allow to disable the subscription on any error? And then
> > let the user check the error either in view or logs and decide whether
> > it would like to enable the subscription or do something before it
> > (like making space in disk, or fixing the network).
> Agreed. I'll treat any errors as the trigger of the feature in the next version.
>
> > The other problem I see with this transient error stuff is maintaining
> > the list of error codes that we think are transient. I think we need a
> > discussion for each of the error_codes we are listing now and whatever
> > new error_code we add in the future which doesn't seem like a good idea.
> This is also true. The maintenance cost of my current implementation didn't
> sound cheap.
>
> > I think the code to deal with apply worker errors and then disable the
> > subscription has some flaws. Say, while disabling the subscription if
> > it leads to another error then I think the original error won't be
> > reported. Can't we simply emit the error via EmitErrorReport and then
> > do AbortOutOfAnyTransaction, FlushErrorState, and any other memory
> > context clean up if required and then disable the subscription after coming
> out of catch?
> You are right. I'll fix related parts accordingly.
Hi, I've made a new patch v11 that incorporated suggestions described above.

There are several notes to share regarding v11 modifications.

1. Modified the commit message a bit.

2. DisableSubscriptionOnError() doesn't return ErrData anymore,
since now to emit error message is done in the error recovery area
and the function purpose has become purely to run a transaction to disable
the subscription.

3. In DisableSubscriptionOnError(), v10 rethrew the error if the disable_on_error
flag became false in the interim, but v11 just closes the transaction and
finishes the function.

4. If table sync worker detects an error during synchronization
and needs to disable the subscription, the worker disables it and just exit by proc_exit.
The processing after disabling the subscription didn't look necessary to me
for disabled subscription.

5. Only when we succeed in the table synchronization, it's necessary to
allocate slot name in long-lived context, after the table synchronization in
ApplyWorkerMain(). Otherwise, we'll see junk value of syncslotname
because it is the return value of LogicalRepSyncTableStart().

6. There are 3 places for error recovery in ApplyWorkerMain().
All of those might look similar but I didn't make an united function for them.
Those are slightly different from each other and I felt
readability is reduced by grouping them into one type of function call.

7. In v11, I covered the case that apply worker failed with
apply_error_callback_arg.command == 0, as one path to disable the subscription
in order to cover all errors.

8. I changed one flag name from 'disable_subscription' to 'did_error'
in ApplyWorkerMain().

9. All chages in this version are C codes and checked by pgindent.

Best Regards,
Takamichi Osumi

Attachment Content-Type Size
v11-0001-Optionally-disable-subscriptions-on-error.patch application/octet-stream 49.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2021-12-03 13:27:43 Re: pg_get_publication_tables() output duplicate relid
Previous Message Peter Eisentraut 2021-12-03 13:12:35 Re: Some RELKIND macro refactoring