Re: Single transaction in the tablesync worker?

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Single transaction in the tablesync worker?
Date: 2021-02-01 08:40:13
Message-ID: CAA4eK1JdWv84nMyCpTboBURjj70y3BfO1xdy8SYPRqNxtH7TEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 1, 2021 at 1:08 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Mon, Feb 1, 2021 at 5:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> > > > AFAIK there is always a potential race with DropSubscription dropping
> > > > > slots. The DropSubscription might be running at exactly the same time
> > > > > the apply worker has just dropped the very same tablesync slot.
> > > > >
> > > >
> > > > We stopped the workers before getting a list of NotReady relations and
> > > > then we try to drop the corresponding slots. So, how such a race
> > > > condition can happen? Note, because we have a lock on pg_subscrition,
> > > > there is no chance that the workers can restart till the transaction
> > > > end.
> > >
> > > OK. I think I was forgetting the logicalrep_worker_stop would also go
> > > into a loop waiting for the worker process to die. So even if the
> > > tablesync worker does simultaneously drop it's own slot, I think it
> > > will certainly at least be in SYNCDONE state before DropSubscription
> > > does anything else with that worker.
> > >
> >
> > How is that ensured? We don't have anything like HOLD_INTERRUPTS
> > between the time dropped the slot and updated rel state as SYNCDONE.
> > So, isn't it possible that after we dropped the slot and before we
> > update the state, the SIGTERM signal arrives and led to worker exit?
> >
>
> The worker has the SIGTERM handler of "die". IIUC the "die" function
> doesn't normally do anything except set some flags to say please die
> at the next convenient opportunity. My understanding is that the
> worker process will not actually exit until it next executes
> CHECK_FOR_INTERRUPTS(), whereupon it will see the ProcDiePending flag
> and *really* die. So even if the SIGTERM signal arrives immediately
> after the slot is dropped, the tablesync will still become SYNCDONE.
> Is this wrong understanding?
>
> But your scenario could still be possible if "die" exited immediately
> (e.g. only in single user mode?).
>

I think it is possible without that as well. There are many calls
in-between those two operations which can internally call
CHECK_FOR_INTERRUPTS. One of the flows where such a possibility exists
is UpdateSubscriptionRelState->SearchSysCacheCopy2->SearchSysCacheCopy->SearchSysCache->SearchCatCache->SearchCatCacheInternal->SearchCatCacheMiss->systable_getnext.
This can internally call heapgetpage where we have
CHECK_FOR_INTERRUPTS. I think even if today there was no CFI call we
can't take a guarantee for the future as the calls used are quite
common. So, probably we need missing_ok flag in DropSubscription.

One more point in the tablesync code you are calling
ReplicationSlotDropAtPubNode with missing_ok as false. What if we get
an error after that and before we have marked the state as SYNCDONE? I
guess it will always error from ReplicationSlotDropAtPubNode after
that because we had already dropped the slot.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-02-01 08:45:42 Re: Printing backtrace of postgres processes
Previous Message Dmitry Dolgov 2021-02-01 08:39:31 Re: [HACKERS] [PATCH] Generic type subscripting