Re: Skipping logical replication transactions on subscriber side

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Alexey Lesovsky <lesovsky(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Skipping logical replication transactions on subscriber side
Date: 2022-01-24 04:48:48
Message-ID: CAKFQuwYqGOai6JKBYO4zr6xs30TfxQwiiksU55styFerWbG8Lg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 23, 2022 at 8:35 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> > I really dislike the user experience this provides, and given it is new
> in v15 (and right now this table seems to exist solely to support this
> feature) changing this seems within the realm of possibility. I have to
> imagine these workers have a sense of local state that would just be "no
> errors, no need to touch pg_stat_subscription_workers at the end of this
> transaction's commit". It would save a local state of the error_xid and if
> a successfully committed transaction has that xid it would clear the
> error. The skip code path would also check for and see the matching xid
> value and clear the error. Even if the local state thing doesn't work, one
> catalog lookup per transaction seems like potentially reasonable overhead
> to incur here.
> >
>
> Are you telling to update the catalog to save error_xid when an error
> occurs? If so, that has many challenges like we are not supposed to
> perform any such operations when the transaction is in an error state.
> We have discussed this and other ideas in the beginning. I don't find
> any of your arguments convincing to change the basic approach here but
> I would like to see what others think on this matter?
>
>
Then how does the table get updated to that state in the first place since
it doesn't know the error details until there is an error?

In any case, clearing out the entries in the table would not happen while
it is applying the replication stream, in an error state or otherwise.

in = while streaming
out = not streaming

1(in). replication stream is working
2(in). replication stream fails; capture error information
3(in->out). stop replication stream; perform rollback on xid
4(out). update pg_stat_subscription_worker to report the failure, including
xid of the transaction
5(out). wait for the user to manually restart the replication stream
[if they do so by skipping the xid, save the xid from
pg_stat_subscription_worker into pg_subscription.subskipxid - possibly
requiring the user to confirm the xid]
[user has now done their thing and requested that the replication stream
resume]
6(out). clear the error information from pg_stat_subscription_worker; it is
no longer useful/doesn't exist because the user just took action to avoid
that very error, one way (skipping its transaction) or another.
7(out->in). resume the replication stream, return to step 1

You are already doing steps 1-5 and 7 today however you are forced to deal
with transactions and catalog access. I am just adding step 6, which turns
last_error_xid into current_error_xid because it is current value of the
error in the stream during step 5 when the user needs to decide how to
recover from the error. Once the user decides and the stream resumes that
error information has no value (go look in the logs if you want history).
Thus when 7 comes around and the stream is restarted the error info in
pg_stat_subscription_worker is empty waiting for the next error to happen.
If the user did nothing in step 5 then when that same wal is replayed at
step 2 the error will come back.

The main thing is how many ways can the user exit step 5 and to make sure
that no matter which way they exit step 6 happens before step 7.

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Nancarrow 2022-01-24 04:59:27 Re: row filtering for logical replication
Previous Message David G. Johnston 2022-01-24 04:20:37 Re: Bogus duplicate command issued in pg_dump