Re: Failure of subscription tests with topminnow

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Ajin Cherian <itsajin(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Failure of subscription tests with topminnow
Date: 2021-08-26 01:01:56
Message-ID: CAD21AoAxtiP6G788h4nG89Y=E4-WbDU3UbMyT0q8TFGBWnW7uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
>
> On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > I did a quick check with the following tap test code:
> > >
> > > $node_publisher->poll_query_until('postgres',
> > > qq(
> > > select 1 != foo.column1 from (values(0), (1)) as foo;
> > > ));
> > >
> > > The query returns {t, f} but poll_query_until() never finished. The
> > > same is true when the query returns {f, t}.
> > >
>
> Yes, this is true, I also see the same behaviour.
>
> >
> > This means something different is going on in Ajin's setup. Ajin, can
> > you please share how did you confirm your findings about poll_query?
>
> Relooking at my logs, I think what happens is this:
>
> 1. First walsender 'a' is running.
> 2. Second walsender 'b' starts and attempts at acquiring the slot
> finds that the slot is active for pid a.
> 3. Now both walsenders are active, the query does not return.
> 4. First walsender 'a' times out and exits.
> 5. Now only the second walsender is active and the query returns OK
> because pid != a.
> 6. Second walsender exits with error.
> 7. Another query attempts to get the pid of the running walsender for
> tap_sub but returns null because both walsender exits.
> 8. This null return value results in the next query erroring out and
> the test failing.

So this is slightly different than what we can see in the topminnow
logs? According to the server logs, step #5 happened (at 18:44:38.016)
before step #4 happened (at 18:44:38.043).

>
> >Can you additionally check the value of 'state' from
> >pg_stat_replication for both the old and new walsender sessions?
>
> Yes, will try this and post a patch tomorrow.

Thanks. I guess the state of the new walsender should be "startup"
whereas the old one should be "streaming".

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-08-26 01:20:52 Re: prevent immature WAL streaming
Previous Message Kyotaro Horiguchi 2021-08-26 00:40:09 Re: prevent immature WAL streaming