Re: Failure of subscription tests with topminnow

From: Ajin Cherian <itsajin(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Failure of subscription tests with topminnow
Date: 2021-08-24 05:34:40
Message-ID: CAFPTHDbyQ3Tqgjv4Y2v_ifGWu6ywNp8KHtyX3LLtrScLqiW1hg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
> walsender process. Now, here the problem seems to be that the previous
> walsender process (16336) didn't exit and the new process started to
> use the slot which leads to the error shown in the log. This is
> evident from the below part of log where we can see that 16336 has
> exited after new process started to use the slot.
>
> 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received
> replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> 0/16BEEB0 (proto_version '1', publication_names
> '"tap_pub","tap_pub_ins_only"')
> 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
> START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> publication_names '"tap_pub","tap_pub_ins_only"')
> 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication
> slot "tap_sub" is active for PID 16336
> 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
> START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> publication_names '"tap_pub","tap_pub_ins_only"')
> 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection:
> session time: 0:00:00.036 user=nm database=postgres host=[local]
> 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection:
> session time: 0:00:06.367 user=nm database=postgres host=[local]
>
> One idea to solve this is to first disable the subscription, wait for
> the walsender process to exit, and then change the connection string
> and re-enable the subscription.

I tried to simulate this by putting delays prior to the exit of the
walsender. Even though I see the same error
in the logs that the replication slot is active for the previous PID,
the test case does not fail in the way seen in this case here..

The new walsender tries to acquire the slot but seeing that there is
another PID that is active on the slot, it
errors and exits. At no point does this new failed walsender update
its pid for the slot. As a result, the below polled query
would not return a true value

$node_publisher->poll_query_until('postgres',
"SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub';"
) or die "Timed out while waiting for apply to restart";

In my runs I see that this query is repeated until a new walsender is
able to successfully acquire the slot.
I am not able to explain why this query returned true in the topminnow
tap test. This would imply that a walsender
was able to acquire the slot and update its pid but I don't see how.
We also know that if this polled query returns
a true (implying a pid other than $oldpid), then the next query in the
test is to try and identify the pid:

SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';

From the topminnow logs we know that this query returned a "NULL", as
the value extracted from this is used
to formulate the next query which errored out and eventually caused
the test case to fail.

[16482:3] 001_rep_changes.pl ERROR: syntax error at or near "FROM"
at character 16
[16482:4] 001_rep_changes.pl STATEMENT: SELECT pid != FROM
pg_stat_replication WHERE application_name = 'tap_sub';

I am not an expert in perl but I looked at the perl function used in
the tap test poll_query_until(), this would poll until the query
returns a 'true' (unless specified otherwise).
I don't see in my tests that the polled function exits if the query
returns a NULL. I don't know if differences in the installed perl can
cause this difference in
behaviour. Can a NULL set be construed as a true and cause the poll to exit?

Any suggestions?

regards,
Ajin Cherian
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2021-08-24 05:35:23 Re: Mark all GUC variable as PGDLLIMPORT
Previous Message Craig Ringer 2021-08-24 05:21:51 Re: Mark all GUC variable as PGDLLIMPORT