Re: Skipping logical replication transactions on subscriber side

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Alexey Lesovsky <lesovsky(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Skipping logical replication transactions on subscriber side
Date: 2021-11-30 16:44:06
Message-ID: CALDaNm2i8z-Uxs7gb00ZqzW0xfKSwq=Qy3QJRnBND0pOPwBhCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 30, 2021 at 7:09 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Tue, Nov 30, 2021 at 8:41 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Tue, Nov 30, 2021 at 6:28 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Nov 29, 2021 at 11:38 AM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > >
> > >
> > > I have pushed this patch and there is a buildfarm failure for it. See:
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2021-11-30%2005%3A05%3A25
> > >
> > > Sawada-San has shared his initial analysis on pgsql-committers [1] and
> > > I am responding here as the fix requires some more discussion.
> > >
> > > > Looking at the result the test actually got, we had two error entries
> > > > for test_tab1 instead of one:
> > > >
> > > > # Failed test 'check the error reported by the apply worker'
> > > > # at t/026_worker_stats.pl line 33.
> > > > # got: 'tap_sub|INSERT|test_tab1|t
> > > > # tap_sub||test_tab1|t'
> > > > # expected: 'tap_sub|INSERT|test_tab1|t'
> > > >
> > > > The possible scenarios are:
> > > >
> > > > The table sync worker for test_tab1 failed due to an error unrelated
> > > > to apply changes:
> > > >
> > > > 2021-11-30 06:24:02.137 CET [18990:2] ERROR: replication origin with
> > > > OID 2 is already active for PID 23706
> > > >
> > > > At this time, the view had one error entry for the table sync worker.
> > > > After retrying table sync, it succeeded:
> > > >
> > > > 2021-11-30 06:24:04.202 CET [28117:2] LOG: logical replication table
> > > > synchronization worker for subscription "tap_sub", table "test_tab1"
> > > > has finished
> > > >
> > > > Then after inserting a row on the publisher, the apply worker inserted
> > > > the row but failed due to violating a unique key violation, which is
> > > > expected:
> > > >
> > > > 2021-11-30 06:24:04.307 CET [4806:2] ERROR: duplicate key value
> > > > violates unique constraint "test_tab1_pkey"
> > > > 2021-11-30 06:24:04.307 CET [4806:3] DETAIL: Key (a)=(1) already exists.
> > > > 2021-11-30 06:24:04.307 CET [4806:4] CONTEXT: processing remote data
> > > > during "INSERT" for replication target relation "public.test_tab1" in
> > > > transaction 721 at 2021-11-30 06:24:04.305096+01
> > > >
> > > > As a result, we had two error entries for test_tab1: the table sync
> > > > worker error and the apply worker error. I didn't expect that the
> > > > table sync worker for test_tab1 failed due to "replication origin with
> > > > OID 2 is already active for PID 23706” error.
> > > >
> > > > Looking at test_subscription_error() in 026_worker_stats.pl, we have
> > > > two checks; in the first check, we wait for the view to show the error
> > > > entry for the given relation name and xid. This check was passed since
> > > > we had the second error (i.g., apply worker error). In the second
> > > > check, we get error entries from pg_stat_subscription_workers by
> > > > specifying only the relation name. Therefore, we ended up getting two
> > > > entries and failed the tests.
> > > >
> > > > To fix this issue, I think that in the second check, we can get the
> > > > error from pg_stat_subscription_workers by specifying the relation
> > > > name *and* xid like the first check does. I've attached the patch.
> > > > What do you think?
> > > >
> > >
> > > I think this will fix the reported failure but there is another race
> > > condition in the test. Isn't it possible that for table test_tab2, we
> > > get an error "replication origin with OID ..." or some other error
> > > before copy, in that case also, we will proceed from the second call
> > > of test_subscription_error() which is not what we expect in the test?
> >
> > Right.
> >
> > > Shouldn't we someway check that the error message also starts with
> > > "duplicate key value violates ..."?
> >
> > Yeah, I think it's a good idea to make the checks more specific. That
> > is, probably we can specify the prefix of the error message and
> > subrelid in addition to the current conditions: relid and xid. That
> > way, we can check what error was reported by which workers (tablesync
> > or apply) for which relations. And both check queries in
> > test_subscription_error() can have the same WHERE clause.
>
> I've attached a patch that fixes this issue. Please review it.

Thanks for the updated patch, the patch applies neatly and make
check-world passes. Also I ran the failing test in a loop and found it
to be passing always.

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2021-11-30 16:47:18 Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
Previous Message vignesh C 2021-11-30 16:27:04 Re: row filtering for logical replication