Re: Skipping logical replication transactions on subscriber side

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Alexey Lesovsky <lesovsky(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Skipping logical replication transactions on subscriber side
Date: 2022-04-01 12:25:52
Message-ID: CAD21AoDF1N98AC91KVQXecLcTQCdUtrHR+YU-xNUStULpFqm1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 1, 2022 at 5:10 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Fri, Apr 1, 2022 at 4:44 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> >
> > On Tue, Mar 29, 2022 at 10:43:00AM +0530, Amit Kapila wrote:
> > > On Mon, Mar 21, 2022 at 5:51 PM Euler Taveira <euler(at)eulerto(dot)com> wrote:
> > > > On Mon, Mar 21, 2022, at 12:25 AM, Amit Kapila wrote:
> > > > I have fixed all the above comments as per your suggestion in the
> > > > attached. Do let me know if something is missed?
> > > >
> > > > Looks good to me.
> > >
> > > This patch is committed
> > > (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=208c5d65bbd60e33e272964578cb74182ac726a8).
> >
> > src/test/subscription/t/029_on_error.pl has been failing reliably on the five
> > AIX buildfarm members:
> >
> > # poll_query_until timed out executing this query:
> > # SELECT subskiplsn = '0/0' FROM pg_subscription WHERE subname = 'sub'
> > # expecting this output:
> > # t
> > # last actual query output:
> > # f
> > # with stderr:
> > timed out waiting for match: (?^:LOG: done skipping logical replication transaction finished at 0/1D30788) at t/029_on_error.pl line 50.
> >
> > I've posted five sets of logs (2.7 MiB compressed) here:
> > https://drive.google.com/file/d/16NkyNIV07o0o8WM7GwcaAYFQDPTkULkR/view?usp=sharing
>
> Thank you for the report. I'm investigating this issue.

Looking at the subscriber logs, it successfully fetched the correct
error-LSN from the server logs and set it to ALTER SUBSCRIPTION …
SKIP:

2022-03-30 09:48:36.617 UTC [17039636:4] CONTEXT: processing remote
data for replication origin "pg_16391" during "INSERT" for replication
target relation "public.tbl" in transaction 725 finished at 0/1D30788
2022-03-30 09:48:36.617 UTC [17039636:5] LOG: logical replication
subscription "sub" has been disabled due to an error
:
2022-03-30 09:48:36.670 UTC [17039640:1] [unknown] LOG: connection
received: host=[local]
2022-03-30 09:48:36.672 UTC [17039640:2] [unknown] LOG: connection
authorized: user=nm database=postgres application_name=029_on_error.pl
2022-03-30 09:48:36.675 UTC [17039640:3] 029_on_error.pl LOG:
statement: ALTER SUBSCRIPTION sub SKIP (lsn = '0/1D30788')
2022-03-30 09:48:36.676 UTC [17039640:4] 029_on_error.pl LOG:
disconnection: session time: 0:00:00.006 user=nm database=postgres
host=[local]
:
2022-03-30 09:48:36.762 UTC [28246036:2] ERROR: duplicate key value
violates unique constraint "tbl_pkey"
2022-03-30 09:48:36.762 UTC [28246036:3] DETAIL: Key (i)=(1) already exists.
2022-03-30 09:48:36.762 UTC [28246036:4] CONTEXT: processing remote
data for replication origin "pg_16391" during "INSERT" for replication
target relation "public.tbl" in transaction 725 finished at 0/1D30788

However, the worker could not start skipping changes of the error
transaction for some reason. Given that "SELECT subskiplsn = '0/0'
FROM pg_subscription WHERE subname = 'sub’” didn't return true, some
value was set to subskiplsn even after the unique key error.

So I'm guessing that the apply worker could not get the updated value
of the subskiplsn or its MySubscription->skiplsn could not match with
the transaction's finish LSN. Also, given that the test is failing on
all AIX buildfarm members, there might be something specific to AIX.

Noah, to investigate this issue further, is it possible for you to
apply the attached patch and run the 029_on_error.pl test? The patch
adds some logs to get additional information.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachment Content-Type Size
add_logs.patch application/octet-stream 1.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari Mannsåker 2022-04-01 12:33:05 Re: Is monotonous xid8 is a right way to do?
Previous Message Pavel Borisov 2022-04-01 12:13:17 Is monotonous xid8 is a right way to do?