Re: Review for GetWALAvailability()

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: alvherre(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Review for GetWALAvailability()
Date: 2020-06-17 11:13:01
Message-ID: f898aa30-053e-3598-f1f1-4b3b431f8f30@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/06/17 17:30, Kyotaro Horiguchi wrote:
> At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
>>
>>
>> On 2020/06/17 12:10, Kyotaro Horiguchi wrote:
>>> At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera
>>> <alvherre(at)2ndquadrant(dot)com> wrote in
>>>> On 2020-Jun-17, Fujii Masao wrote:
>>>>> On 2020/06/17 3:50, Alvaro Herrera wrote:
>>>>
>>>>> So InvalidateObsoleteReplicationSlots() can terminate normal backends.
>>>>> But do we want to do this? If we want, we should add the note about
>>>>> this
>>>>> case into the docs? Otherwise the users would be surprised at
>>>>> termination
>>>>> of backends by max_slot_wal_keep_size. I guess that it's basically
>>>>> rarely
>>>>> happen, though.
>>>>
>>>> Well, if we could distinguish a walsender from a non-walsender
>>>> process,
>>>> then maybe it would make sense to leave backends alive. But do we
>>>> want
>>>> that? I admit I don't know what would be the reason to have a
>>>> non-walsender process with an active slot, so I don't have a good
>>>> opinion on what to do in this case.
>>> The non-walsender backend is actually doing replication work. It
>>> rather should be killed?
>>
>> I have no better opinion about this. So I agree to leave the logic as
>> it is
>> at least for now, i.e., we terminate the process owning the slot
>> whatever
>> the type of process is.
>
> Agreed.
>
>>>>>>> + /*
>>>>>>> + * Signal to terminate the process using the replication slot.
>>>>>>> + *
>>>>>>> + * Try to signal every 100ms until it succeeds.
>>>>>>> + */
>>>>>>> + if (!killed && kill(active_pid, SIGTERM) == 0)
>>>>>>> + killed = true;
>>>>>>> + ConditionVariableTimedSleep(&slot->active_cv, 100,
>>>>>>> + WAIT_EVENT_REPLICATION_SLOT_DROP);
>>>>>>> + } while (ReplicationSlotIsActive(slot, NULL));
>>>>>>
>>>>>> Note that here you're signalling only once and then sleeping many
>>>>>> times
>>>>>> in increments of 100ms -- you're not signalling every 100ms as the
>>>>>> comment claims -- unless the signal fails, but you don't really expect
>>>>>> that. On the contrary, I'd claim that the logic is reversed: if the
>>>>>> signal fails, *then* you should stop signalling.
>>>>>
>>>>> You mean; in this code path, signaling fails only when the target
>>>>> process
>>>>> disappears just before signaling. So if it fails, slot->active_pid is
>>>>> expected to become 0 even without signaling more. Right?
>>>>
>>>> I guess kill() can also fail if the PID now belongs to a process owned
>>>> by a different user.
>>
>> Yes. This case means that the PostgreSQL process using the slot
>> disappeared
>> and the same PID was assigned to non-PostgreSQL process. So if kill()
>> fails
>> for this reason, we don't need to kill() again.
>>
>>> I think we've disregarded very quick reuse of
>>>> PIDs, so we needn't concern ourselves with it.
>>> The first time call to ConditionVariableTimedSleep doen't actually
>>> sleep, so the loop works as expected. But we may make an extra call
>>> to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the
>>> loop would make it better.
>>
>> Sorry I failed to understand your point...
>
> My point is the ConditionVariableTimedSleep does *not* sleep on the CV
> first time in this usage. The new version anyway avoids useless
> kill(2) call, but still may make an extra call to
> ReplicationSlotAcquireInternal. I think we should call
> ConditionVariablePrepareToSleep before the sorrounding for statement
> block.

OK, so what about the attached patch? I added ConditionVariablePrepareToSleep()
just before entering the "for" loop in InvalidateObsoleteReplicationSlots().

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment Content-Type Size
invalidate_obsolete_replication_slots_v3.patch text/plain 9.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-06-17 11:14:07 Re: Resetting spilled txn statistics in pg_stat_replication
Previous Message Alexander Korotkov 2020-06-17 11:00:15 Re: Operator class parameters and sgml docs