Re: Failure in subscription test 004_sync.pl

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: Failure in subscription test 004_sync.pl
Date: 2021-06-14 05:11:14
Message-ID: CAD21AoBWqUPeXbZviHWZhdZVMP3vop_k43ipzFwVa=emE7M3Og@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 12, 2021 at 9:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > wrasse has just failed with what looks like a timing error with a
> > replication slot drop:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30
> >
> > Here is the error:
> > error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication
> > slot "tap_sub" on publisher: ERROR: replication slot "tap_sub" is
> > active for PID 1641'
> >
> > It seems to me that this just lacks a poll_query_until() doing some
> > slot monitoring?
> >
>
> I think it is showing a race condition issue in the code. In
> DropSubscription, we first stop the worker that is receiving the WAL,
> and then in a separate connection with the publisher, it tries to drop
> the slot which leads to this error. The reason is that walsender is
> still active as we just wait for wal receiver (or apply worker) to
> stop. Normally, as soon as the apply worker is stopped the walsender
> detects it and exits but in this case, it took some time to exit, and
> in the meantime, we tried to drop the slot which is still in use by
> walsender.

There might be possible.

That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT
command with WAIT option. I found a bug that is possibly an oversight
of commit 1632ea4368. The commit changed the code around the error as
follows:

if (active_pid != MyProcPid)
{
- if (behavior == SAB_Error)
+ if (!nowait)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_IN_USE),
errmsg("replication slot \"%s\" is active for PID %d",
NameStr(s->data.name), active_pid)));
- else if (behavior == SAB_Inquire)
- return active_pid;

/* Wait here until we get signaled, and then restart */
ConditionVariableSleep(&s->active_cv,

The condition should be the opposite; we should raise the error when
'nowait' is true. I think this is the cause of the test failure. Even
if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
don't wait but raise the error.

Attached a small patch fixes it.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachment Content-Type Size
fix_slot_drop.patch application/octet-stream 406 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2021-06-14 05:41:12 Re: Minimal logical decoding on standbys
Previous Message Shinya11.Kato 2021-06-14 04:53:58 RE: [PATCH] expand the units that pg_size_pretty supports on output