Re: Failure of subscription tests with topminnow

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Ajin Cherian <itsajin(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Failure of subscription tests with topminnow
Date: 2021-08-26 04:45:21
Message-ID: CAD21AoA1tX=CgMPXVtQ9L6uA1cg44j85=Y1GhNdUFchJ=bzpqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 26, 2021 at 12:59 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
>
> On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > >
> > > > You have a point but if we see the below logs, it seems the second
> > > > walsender (#step6) seemed to exited before the first walsender
> > > > (#step4).
> > > >
> > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
> > > > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
> > > > session time: 0:00:06.367 user=nm database=postgres host=[local]
> > > >
> > > > Isn't it possible that pid is cleared in the other order due to which
> > > > we are seeing this problem?
> > >
> > > If the pid is cleared in the other order, wouldn't the query [1] return a false?
> > >
> > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
> > > application_name = 'tap_sub';"
> > >
> >
> > I think it should return true because pid for 16336 is cleared first
> > and the remaining one will be 16475.
>
> Yes, that was what I explained as well. 16336 is PID 'a' (first
> walsender) in my explanation. The first walsender should
> be cleared first for this theory to work.

I think that it’s possible that the orders of the process writing
disconnections logs and setting 0 to walsender's pid are mismatched.
We set 0 to walsender's pid in WalSndKill() that is called during
on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't
show the entry. On the other hand, disconnections logs are written by
log_disconnections() that is called during on_proc_exit callback. That
is, the following sequence could happen:

1. the second walsender (pid = 16475) raises an error as the slot is
already active (held by the first walsender).
2. the first walsender (pid = 16336) clears its pid on the shmem.
3. the polling query checks the walsender’s pid, and returns true
(since there is only the second walsender now).
4. the second walsender clears its pid on the shmem.
5. the second walsender writes disconnection log.
6. the first walsender writes disconneciton log.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-08-26 04:59:49 Re: row filtering for logical replication
Previous Message Peter Smith 2021-08-26 04:21:31 Re: row filtering for logical replication