Quick Links

Re: Failure of subscription tests with topminnow

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Ajin Cherian <itsajin(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Failure of subscription tests with topminnow
Date:	2021-08-26 04:45:21
Message-ID:	CAD21AoA1tX=CgMPXVtQ9L6uA1cg44j85=Y1GhNdUFchJ=bzpqA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Aug 26, 2021 at 12:59 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
>
> On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > >
> > > > You have a point but if we see the below logs, it seems the second
> > > > walsender (#step6) seemed to exited before the first walsender
> > > > (#step4).
> > > >
> > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
> > > > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
> > > > session time: 0:00:06.367 user=nm database=postgres host=[local]
> > > >
> > > > Isn't it possible that pid is cleared in the other order due to which
> > > > we are seeing this problem?
> > >
> > > If the pid is cleared in the other order, wouldn't the query [1] return a false?
> > >
> > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
> > > application_name = 'tap_sub';"
> > >
> >
> > I think it should return true because pid for 16336 is cleared first
> > and the remaining one will be 16475.
>
> Yes, that was what I explained as well. 16336 is PID 'a' (first
> walsender) in my explanation. The first walsender should
> be cleared first for this theory to work.

I think that it’s possible that the orders of the process writing
disconnections logs and setting 0 to walsender's pid are mismatched.
We set 0 to walsender's pid in WalSndKill() that is called during
on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't
show the entry. On the other hand, disconnections logs are written by
log_disconnections() that is called during on_proc_exit callback. That
is, the following sequence could happen:

1. the second walsender (pid = 16475) raises an error as the slot is
already active (held by the first walsender).
2. the first walsender (pid = 16336) clears its pid on the shmem.
3. the polling query checks the walsender’s pid, and returns true
(since there is only the second walsender now).
4. the second walsender clears its pid on the shmem.
5. the second walsender writes disconnection log.
6. the first walsender writes disconneciton log.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Re: Failure of subscription tests with topminnow at 2021-08-26 03:59:22 from Ajin Cherian

Responses

Re: Failure of subscription tests with topminnow at 2021-08-26 10:58:40 from Ajin Cherian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2021-08-26 04:59:49	Re: row filtering for logical replication
Previous Message	Peter Smith	2021-08-26 04:21:31	Re: row filtering for logical replication