Re: Review for GetWALAvailability()

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: alvherre(at)2ndquadrant(dot)com
Cc: masao(dot)fujii(at)oss(dot)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Review for GetWALAvailability()
Date: 2020-06-17 03:10:31
Message-ID: 20200617.121031.1692393794056067357.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in
> On 2020-Jun-17, Fujii Masao wrote:
> > On 2020/06/17 3:50, Alvaro Herrera wrote:
>
> > So InvalidateObsoleteReplicationSlots() can terminate normal backends.
> > But do we want to do this? If we want, we should add the note about this
> > case into the docs? Otherwise the users would be surprised at termination
> > of backends by max_slot_wal_keep_size. I guess that it's basically rarely
> > happen, though.
>
> Well, if we could distinguish a walsender from a non-walsender process,
> then maybe it would make sense to leave backends alive. But do we want
> that? I admit I don't know what would be the reason to have a
> non-walsender process with an active slot, so I don't have a good
> opinion on what to do in this case.

The non-walsender backend is actually doing replication work. It
rather should be killed?

> > > > + /*
> > > > + * Signal to terminate the process using the replication slot.
> > > > + *
> > > > + * Try to signal every 100ms until it succeeds.
> > > > + */
> > > > + if (!killed && kill(active_pid, SIGTERM) == 0)
> > > > + killed = true;
> > > > + ConditionVariableTimedSleep(&slot->active_cv, 100,
> > > > + WAIT_EVENT_REPLICATION_SLOT_DROP);
> > > > + } while (ReplicationSlotIsActive(slot, NULL));
> > >
> > > Note that here you're signalling only once and then sleeping many times
> > > in increments of 100ms -- you're not signalling every 100ms as the
> > > comment claims -- unless the signal fails, but you don't really expect
> > > that. On the contrary, I'd claim that the logic is reversed: if the
> > > signal fails, *then* you should stop signalling.
> >
> > You mean; in this code path, signaling fails only when the target process
> > disappears just before signaling. So if it fails, slot->active_pid is
> > expected to become 0 even without signaling more. Right?
>
> I guess kill() can also fail if the PID now belongs to a process owned
> by a different user. I think we've disregarded very quick reuse of
> PIDs, so we needn't concern ourselves with it.

The first time call to ConditionVariableTimedSleep doen't actually
sleep, so the loop works as expected. But we may make an extra call
to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the
loop would make it better.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2020-06-17 03:24:29 Re: valgrind versus pg_atomic_init()
Previous Message Alvaro Herrera 2020-06-17 02:40:56 Re: Review for GetWALAvailability()