RE: subscriptionCheck failures

From: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>
To: 'vignesh C' <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: subscriptionCheck failures
Date: 2021-03-16 12:52:23
Message-ID: OSBPR01MB48885DFBB4B098909675D357ED6B9@OSBPR01MB4888.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

On Tuesday, March 16, 2021 4:15 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> On Tue, Mar 16, 2021 at 12:29 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > On Tue, Mar 16, 2021 at 9:00 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > > On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro
> <thomas(dot)munro(at)gmail(dot)com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > This seems to be a new low frequency failure, I didn't see it mentioned
> already:
> > > >
> > >
> > > Thanks for reporting, I'll look into it.
> > >
> >
> > By looking at the logs [1] in the buildfarm, I think I know what is
> > going on here. After Create Subscription, the tablesync worker is
> > launched and tries to create the slot for doing the initial copy but
> > before it could finish creating the slot, we issued the Drop
> > Subscription which first stops the tablesync worker and then tried to
> > drop its slot. Now, it is quite possible that by the time Drop
> > Subscription tries to drop the tablesync slot, it is not yet created.
> > We treat this condition okay and just Logs the message. I don't think
> > this is an issue because anyway generally such a slot created on the
> > server will be dropped before we persist it but the test was checking
> > the existence of slots on server before it gets dropped. I think we
> > can avoid such a situation by preventing cancel/die interrupts while
> > creating tablesync slot.
> >
> > This is a timing issue, so I have reproduced it via debugger and
> > tested that the attached patch fixes it.
> >
>
> Thanks for the patch.
> I was able to reproduce the issue using debugger by making it wait at
> CreateReplicationSlot. After applying the patch the issue gets solved.
I really appreciate everyone's help.

For the double check, I utilized the patch and debugger too.
I also put one while loop at the top of CreateReplicationSlot to control walsender.

Without the patch, DROP SUBSCRIPTION goes forward,
even when the table sync worker cannot move by the CreateReplicationSlot loop,
and the table sync worker is killed by DROP SUBSCRIPTION command.
On the other hand, with the patch contents, I observed that
DROP SUBSCRIPTION hangs and waits
until I release the walsender process from CreateReplicationSlot.
After this, the command drops two slots like below.

NOTICE: dropped replication slot "pg_16391_sync_16385_6940222843739406079" on publisher
NOTICE: dropped replication slot "mysub1" on publisher
DROP SUBSCRIPTION

To me, this correctly works because
the timing I put the while loop and stops the walsender
makes the DROP SUBSCRIPTION affects two slots. Any comments ?

Best Regards,
Takamichi Osumi

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-03-16 12:52:31 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Peter Eisentraut 2021-03-16 12:23:35 Re: dynamic result sets support in extended query protocol