Re: Failure of subscription tests with topminnow

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Ajin Cherian <itsajin(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Failure of subscription tests with topminnow
Date: 2021-08-24 11:08:20
Message-ID: CAA4eK1KoycXH_OCO9iKbwr0pakPQ1+5RWq8UZPhw6MPFkU-MLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 24, 2021 at 11:04 AM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:
>
> On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> > The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
> > walsender process. Now, here the problem seems to be that the previous
> > walsender process (16336) didn't exit and the new process started to
> > use the slot which leads to the error shown in the log. This is
> > evident from the below part of log where we can see that 16336 has
> > exited after new process started to use the slot.
> >
> > 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
> > replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> > 0/16BEEB0 (proto_version '1', publication_names
> > '"tap_pub","tap_pub_ins_only"')
> > 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
> > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> > publication_names '"tap_pub","tap_pub_ins_only"')
> > 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
> > slot "tap_sub" is active for PID 16336
> > 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
> > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> > publication_names '"tap_pub","tap_pub_ins_only"')
> > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
> > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
> > session time: 0:00:06.367 user=nm database=postgres host=[local]
> >
> > One idea to solve this is to first disable the subscription, wait for
> > the walsender process to exit, and then change the connection string
> > and re-enable the subscription.
>
> I tried to simulate this by putting delays prior to the exit of the
> walsender. Even though I see the same error
> in the logs that the replication slot is active for the previous PID,
> the test case does not fail in the way seen in this case here..
>
> The new walsender tries to acquire the slot but seeing that there is
> another PID that is active on the slot, it
> errors and exits. At no point does this new failed walsender update
> its pid for the slot. As a result, the below polled query
> would not return a true value
>
> $node_publisher->poll_query_until('postgres',
> "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub';"
> ) or die "Timed out while waiting for apply to restart";
>
> In my runs I see that this query is repeated until a new walsender is
> able to successfully acquire the slot.
>

What happens when there is neither a new walsender nor an old
walsender is present? It means to run the above statement when a new
walsender is exited due to error "... slot is active ..." and before a
new walsender could start. Also, allow old walsender (due to which the
error occurs) to exit before executing the statement.

Also, it seems this failure happens on REL_11_STABLE branch, not sure
if that matters, but it is better to use the same branch if you are
using a different branch to reproduce the issue.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2021-08-24 11:20:16 Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN
Previous Message Laurenz Albe 2021-08-24 10:34:25 Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead