Re: Skipping logical replication transactions on subscriber side

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Skipping logical replication transactions on subscriber side
Date: 2021-05-27 08:15:41
Message-ID: CAD21AoDyGQNVDUyY2b3=hfPm359GUP=Yep7i=ojL42cP+QAmpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 27, 2021 at 2:48 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, May 27, 2021 at 9:56 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Wed, May 26, 2021 at 3:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Tue, May 25, 2021 at 12:26 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, May 24, 2021 at 7:51 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > > >
> > > > > I think you need to consider few more things here:
> > > > > (a) Say the error occurs after applying some part of changes, then
> > > > > just skipping the remaining part won't be sufficient, we probably need
> > > > > to someway rollback the applied changes (by rolling back the
> > > > > transaction or in some other way).
> > > >
> > > > After more thought, it might be better to that setting and resetting
> > > > the XID to skip requires disabling the subscription.
> > > >
> > >
> > > It might be better if it doesn't require disabling the subscription
> > > because it would be more steps for the user to disable/enable it. It
> > > is not clear to me what exactly you want to gain by disabling the
> > > subscription in this case.
> >
> > The situation I’m considered is where the user specifies the XID while
> > the worker is applying the changes of the transaction with that XID.
> > In this case, I think we need to somehow rollback the changes applied
> > so far. Perhaps we can either rollback the transaction and ignore the
> > remaining changes or restart and ignore the entire transaction from
> > the beginning.
> >
>
> If we follow your suggestion of only allowing XIDs that have been
> known to have conflicts then probably we don't need to worry about
> rollbacks.
>
> > > > >
> > > > > > For (2), what I'm thinking is to add a new action to ALTER
> > > > > > SUBSCRIPTION command like ALTER SUBSCRIPTION test_sub SET SKIP
> > > > > > TRANSACTION 590. Also, we can have actions to reset it; ALTER
> > > > > > SUBSCRIPTION test_sub RESET SKIP TRANSACTION. Those commands add the
> > > > > > XID to a new column of pg_subscription or a new catalog, having the
> > > > > > worker reread its subscription information. Once the worker skipped
> > > > > > the specified transaction, it resets the transaction to skip on the
> > > > > > catalog.
> > > > > >
> > > > >
> > > > > What if we fail while updating the reset information in the catalog?
> > > > > Will it be the responsibility of the user to reset such a transaction
> > > > > or we will retry it after restart of worker? Now, say, we give such a
> > > > > responsibility to the user and the user forgets to reset it then there
> > > > > is a possibility that after wraparound we will again skip the
> > > > > transaction which is not intended. And, if we want to retry it after
> > > > > restart of worker, how will the worker remember the previous failure?
> > > >
> > > > As described above, setting and resetting XID to skip is implemented
> > > > as a normal system catalog change, so it's crash-safe and persisted. I
> > > > think that the worker can either removes the XID or mark it as done
> > > > once it skipped the specified transaction so that it won't skip the
> > > > same XID again after wraparound.
> > > >
> > >
> > > It all depends on when exactly you want to update the catalog
> > > information. Say after skipping commit of the XID, we do update the
> > > corresponding LSN to be communicated as already processed to the
> > > subscriber and then get the error while updating the catalog
> > > information then next time we might not know whether to update the
> > > catalog for skipped XID.
> > >
> > > > Also, it might be better if we reset
> > > > the XID also when a subscription field such as subconninfo is changed
> > > > because it could imply the worker will connect to another publisher
> > > > having a different XID space.
> > > >
> > > > We also need to handle the cases where the user specifies an old XID
> > > > or XID whose transaction is already prepared on the subscriber. I
> > > > think the worker can reset the XID with a warning when it finds out
> > > > that the XID seems no longer valid or it cannot skip the specified
> > > > XID. For example in the former case, it can do that when the first
> > > > received transaction’s XID is newer than the specified XID.
> > > >
> > >
> > > But how can we guarantee that older XID can't be received later? Is
> > > there a guarantee that we receive the transactions on subscriber in
> > > XID order.
> >
> > Considering the above two comments, it might be better to provide a
> > way to skip the transaction that is already known to be conflicted
> > rather than allowing users to specify the arbitrary XID.
> >
>
> Okay, that makes sense but still not sure how will you identify if we
> need to reset XID in case of failure doing that in the previous
> attempt.

It's a just idea but we can record the failed transaction with XID as
well as its commit LSN passed? The sequence I'm thinking is,

1. the worker records the XID and commit LSN of the failed transaction
to a catalog.
2. the user specifies how to resolve that conflict transaction
(currently only 'skip' is supported) and writes to the catalog.
3. the worker does the resolution method according to the catalog. If
the worker didn't start to apply those changes, it can skip the entire
transaction. If did, it rollbacks the transaction and ignores the
remaining.

The worker needs neither to reset information of the last failed
transaction nor to mark the conflicted transaction as resolved. The
worker will ignore that information when checking the catalog if the
commit LSN is passed.

> Also, I am thinking that instead of a stat view, do we need
> to consider having a system table (pg_replication_conflicts or
> something like that) for this because what if stats information is
> lost (say either due to crash or due to udp packet loss), can we rely
> on stats view for this?

Yeah, it seems better to use a catalog.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2021-05-27 08:36:27 Fix RADIUS error reporting in hba file parsing
Previous Message tsunakawa.takay@fujitsu.com 2021-05-27 07:45:11 RE: Parallel Inserts in CREATE TABLE AS