Re: Race condition in InvalidateObsoleteReplicationSlots()

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Fujii Masao <fujii(at)postgresql(dot)org>
Subject: Re: Race condition in InvalidateObsoleteReplicationSlots()
Date: 2021-04-08 11:33:41
Message-ID: CAA4eK1+m+1RERFzcObEsyhORgF3TcA=cXj5WBR_33nDb8YQucg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 8, 2021 at 7:39 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> On 2021-04-07 17:10:37 -0700, Andres Freund wrote:
> > I think this can be solved in two different ways:
> >
> > 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
> > InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
> > slot in the to-be-obsoleted-slot's place.
> >
> > 2) Atomically check whether the slot needs to be invalidated and try to
> > acquire if needed. Don't release ReplicationSlotControlLock between those
> > two steps. Signal the owner to release the slot iff we couldn't acquire the
> > slot. In the latter case wait and then recheck if the slot still needs to
> > be dropped.
> >
> > To me 2) seems better, because we then can also be sure that the slot still
> > needs to be obsoleted, rather than potentially doing so unnecessarily.
> >

+1.

> >
> > It looks to me like several of the problems here stem from trying to reuse
> > code from ReplicationSlotAcquireInternal() (which before this was just named
> > ReplicationSlotAcquire()). I don't think that makes sense, because cases like
> > this want to check if a condition is true, and acquire it only if so.
> >
> > IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
> > except that a different condition is checked, and the if (active_pid) case
> > needs to prepare a condition variable, signal the owner and then wait on the
> > condition variable, to restart after.
>
> I'm also confused by the use of ConditionVariableTimedSleep(timeout =
> 10). Why do we need a timed sleep here in the first place? And why with
> such a short sleep?
>
> I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
> but is aware it's running in checkpointer. I don't think CFI does much
> there? If we are worried about needing to check for interrupts, more
> work is needed.
>
>
> Sketch for a fix attached. I did leave the odd
> ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's
> there...
>

I haven't tested the patch but I couldn't spot any problems while
reading it. A minor point, don't we need to use
ConditionVariableCancelSleep() at some point after doing
ConditionVariableTimedSleep?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2021-04-08 11:36:48 Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Previous Message Bharath Rupireddy 2021-04-08 11:25:22 Simplify backend terminate and wait logic in postgres_fdw test