Re: subscriptionCheck failures

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: subscriptionCheck failures
Date: 2021-03-16 06:59:34
Message-ID: CAA4eK1JnDQF_B9vDj7WQz9uufbMfE5wo=mdG5fbHq9S7Unfhtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 16, 2021 at 9:00 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >
> > Hi,
> >
> > This seems to be a new low frequency failure, I didn't see it mentioned already:
> >
>
> Thanks for reporting, I'll look into it.
>

By looking at the logs [1] in the buildfarm, I think I know what is
going on here. After Create Subscription, the tablesync worker is
launched and tries to create the slot for doing the initial copy but
before it could finish creating the slot, we issued the Drop
Subscription which first stops the tablesync worker and then tried to
drop its slot. Now, it is quite possible that by the time Drop
Subscription tries to drop the tablesync slot, it is not yet created.
We treat this condition okay and just Logs the message. I don't think
this is an issue because anyway generally such a slot created on the
server will be dropped before we persist it but the test was checking
the existence of slots on server before it gets dropped. I think we
can avoid such a situation by preventing cancel/die interrupts while
creating tablesync slot.

This is a timing issue, so I have reproduced it via debugger and
tested that the attached patch fixes it.

[1]:
2021-02-23 09:57:47.593 UTC [6034d19b.291aed:7] LOG: received
replication command: CREATE_REPLICATION_SLOT
"pg_16396_sync_16384_6932396177428838370" LOGICAL pgoutput
USE_SNAPSHOT
2021-02-23 09:57:47.593 UTC [6034d19b.291aed:8] STATEMENT:
CREATE_REPLICATION_SLOT "pg_16396_sync_16384_6932396177428838370"
LOGICAL pgoutput USE_SNAPSHOT
2021-02-23 09:57:47.664 UTC [6034d19b.291ae2:14] LOG: disconnection:
session time: 0:00:00.130 user=andres database=postgres host=[local]
2021-02-23 09:57:47.686 UTC [6034d19b.291af3:1] LOG: connection
received: host=[local]
2021-02-23 09:57:47.687 UTC [6034d19b.291af3:2] LOG: replication
connection authorized: user=andres application_name=tap_sub
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:3] LOG: statement:
SELECT pg_catalog.set_config('search_path', '', false);
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:4] LOG: received
replication command: DROP_REPLICATION_SLOT
pg_16396_sync_16384_6932396177428838370 WAIT
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:5] STATEMENT:
DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:6] ERROR: replication
slot "pg_16396_sync_16384_6932396177428838370" does not exist
2021-02-23 09:57:47.688 UTC [6034d19b.291af3:7] STATEMENT:
DROP_REPLICATION_SLOT pg_16396_sync_16384_6932396177428838370 WAIT

--
With Regards,
Amit Kapila.

Attachment Content-Type Size
v1-0001-Prevent-cancel-die-interrupts-while-creating-tabl.patch application/octet-stream 1.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-03-16 07:15:15 Re: subscriptionCheck failures
Previous Message Fujii Masao 2021-03-16 06:42:27 Re: Type of wait events WalReceiverWaitStart and WalSenderWaitForWAL