Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Alexey Vasiliev <leopard_ne(at)inbox(dot)ru>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Date: 2015-02-23 11:57:29
Message-ID: CAHGQGwE6U+jyjm6qutmy90-mVJ+GiaZY9KQynEXJnAH9h2gt+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Feb 20, 2015 at 9:33 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Fri, Feb 20, 2015 at 9:12 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Tue, Feb 10, 2015 at 10:32 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> On Mon, Feb 9, 2015 at 8:29 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> On Sun, Feb 8, 2015 at 2:54 PM, Michael Paquier
>>>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>>>> On Fri, Feb 6, 2015 at 4:58 PM, Fujii Masao wrote:
>>>>>> - * Wait for more WAL to arrive. Time out after 5 seconds,
>>>>>> - * like when polling the archive, to react to a trigger
>>>>>> - * file promptly.
>>>>>> + * Wait for more WAL to arrive. Time out after
>>>>>> the amount of
>>>>>> + * time specified by wal_retrieve_retry_interval, like
>>>>>> + * when polling the archive, to react to a
>>>>>> trigger file promptly.
>>>>>> */
>>>>>> WaitLatch(&XLogCtl->recoveryWakeupLatch,
>>>>>> WL_LATCH_SET | WL_TIMEOUT,
>>>>>> - 5000L);
>>>>>> + wal_retrieve_retry_interval * 1000L);
>>>>>>
>>>>>> This change can prevent the startup process from reacting to
>>>>>> a trigger file. Imagine the case where the large interval is set
>>>>>> and the user want to promote the standby by using the trigger file
>>>>>> instead of pg_ctl promote. I think that the sleep time should be 5s
>>>>>> if the interval is set to more than 5s. Thought?
>>>>>
>>>>> I disagree here. It is interesting to accelerate the check of WAL
>>>>> availability from a source in some cases for replication, but the
>>>>> opposite is true as well as mentioned by Alexey at the beginning of
>>>>> the thread to reduce the number of requests when requesting WAL
>>>>> archives from an external storage type AWS. Hence a correct solution
>>>>> would be to check periodically for the trigger file with a maximum
>>>>> one-time wait of 5s to ensure backward-compatible behavior. We could
>>>>> reduce it to 1s or something like that as well.
>>>>
>>>> You seem to have misunderstood the code in question. Or I'm missing something.
>>>> The timeout of the WaitLatch is just the interval to check for the trigger file
>>>> while waiting for more WAL to arrive from streaming replication. Not related to
>>>> the retry time to restore WAL from the archive.
>>>
>>> [Re-reading the code...]
>>> Aah.. Yes you are right. Sorry for the noise. Yes let's wait for a
>>> maximum of 5s then. I also noticed in previous patch that the wait was
>>> maximized to 5s. To begin with, a loop should have been used if it was
>>> a sleep, but as now patch uses a latch this limit does not make much
>>> sense... Patch updated is attached.
>>
>> On second thought, the interval to check the trigger file is very different
>> from the wait time to retry to retrieve WAL. So it seems strange and even
>> confusing to control them by one parameter. If we really want to make
>> the interval for the trigger file configurable, we should invent new GUC for it.
>> But I don't think that it's worth doing that. If someone wants to react the
>> trigger file more promptly for "fast" promotion, he or she basically can use
>> pg_ctl promote command, instead. Thought?
>
> Hm, OK.
>
>> Attached is the updated version of the patch. I changed the parameter so that
>> it doesn't affect the interval of checking the trigger file.
>>
>> - static pg_time_t last_fail_time = 0;
>> - pg_time_t now;
>> + TimestampTz now = GetCurrentTimestamp();
>> + TimestampTz last_fail_time = now;
>>
>> I reverted the code here as it was. I don't think GetCurrentTimestamp() needs
>> to be called for each WaitForWALToBecomeAvailable().
>>
>> + WaitLatch(&XLogCtl->recoveryWakeupLatch,
>> + WL_LATCH_SET | WL_TIMEOUT,
>> + wait_time / 1000);
>>
>> Don't we need to specify WL_POSTMASTER_DEATH flag here? Added.
>
> Yeah, I am wondering though why this has not been added after 89fd72cb though.
>
>> + {"wal_retrieve_retry_interval", PGC_SIGHUP, WAL_SETTINGS,
>>
>> WAL_SETTINGS should be REPLICATION_STANDBY? Changed.
>
> Sure.

So I pushed the patch.

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2015-02-23 12:22:02 Re: [REVIEW] Re: Compression of full-page-writes
Previous Message Andres Freund 2015-02-23 11:01:16 Re: Redesigning checkpoint_segments