Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Alexey Vasiliev <leopard_ne(at)inbox(dot)ru>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Date: 2015-01-20 03:57:37
Message-ID: CAB7nPqQ4iZzjDrJc5VBvHY6uG7wGe9A8pjPSEfj3H7_NfNiu9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 20, 2015 at 1:56 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2015-01-19 17:16:11 +0900, Michael Paquier wrote:
>> On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> > On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> Not this patch's fault, but I'm getting a bit tired seeing the above open
>> >> coded. How about adding a function that does the sleeping based on a
>> >> timestamptz and a ms interval?
>> > You mean in plugins, right? I don't recall seeing similar patterns in
>> > other code paths of backend. But I think that we can use something
>> > like that in timestamp.c then because we need to leverage that between
>> > two timestamps, the last failure and now():
>> > TimestampSleepDifference(start_time, stop_time, internal_ms);
>> > Perhaps you have something else in mind?
>> >
>> > Attached is an updated patch.
>
>> Actually I came with better than last patch by using a boolean flag as
>> return value of TimestampSleepDifference and use
>> TimestampDifferenceExceeds directly inside it.
>
>> Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching
>> on failure
>
> I think that name isn't a very good. And its isn't very accurate
> either.
> How about wal_retrieve_retry_interval? Not very nice, but I think it's
> still better than the above.
Fine for me.

>> + <varlistentry id="wal-availability-check-interval" xreflabel="wal_availability_check_interval">
>> + <term><varname>wal_availability_check_interval</varname> (<type>integer</type>)
>> + <indexterm>
>> + <primary><varname>wal_availability_check_interval</> recovery parameter</primary>
>> + </indexterm>
>> + </term>
>> + <listitem>
>> + <para>
>> + This parameter specifies the amount of time to wait when
>> + WAL is not available for a node in recovery. Default value is
>> + <literal>5s</>.
>> + </para>
>> + <para>
>> + A node in recovery will wait for this amount of time if
>> + <varname>restore_command</> returns nonzero exit status code when
>> + fetching new WAL segment files from archive or when a WAL receiver
>> + is not able to fetch a WAL record when using streaming replication.
>> + </para>
>> + </listitem>
>> + </varlistentry>
>> +
>> </variablelist>
>
> Walreceiver doesn't wait that amount, but rather how long the connection
> is intact. And restore_command may or may not retry.
I changed this paragraph as follows:
+ This parameter specifies the amount of time to wait when a failure
+ occurred when reading WAL from a source (be it via streaming
+ replication, local <filename>pg_xlog</> or WAL archive) for a node
+ in standby mode, or when WAL is expected from a source. Default
+ value is <literal>5s</>.
Pondering more about this patch, I am getting the feeling that it is
not that much a win to explain in details in the doc each failure
depending on the source from which WAL is taken. But it is essential
to mention that the wait is done only for a node using standby mode.

Btw, I noticed two extra things that I think should be done for consistency:
- We should use as well wal_retrieve_retry_interval when waiting for
WAL to arrive after checking for the failures:
/*
- * Wait for more WAL to
arrive. Time out after 5 seconds,
- * like when polling the
archive, to react to a trigger
- * file promptly.
+ * Wait for more WAL to
arrive. Time out after
+ * wal_retrieve_retry_interval
ms, like when polling the archive
+ * to react to a trigger file promptly.
*/
WaitLatch(&XLogCtl->recoveryWakeupLatch,
WL_LATCH_SET
| WL_TIMEOUT,
- 5000L);
+
wal_retrieve_retry_interval * 1L);
-

Updated patch attached.
--
Michael

Attachment Content-Type Size
20150119_wal_avail_check_v3.patch application/x-patch 8.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-01-20 04:10:59 Re: B-Tree support function number 3 (strxfrm() optimization)
Previous Message tim_wilson 2015-01-20 03:53:03 Re: Inaccuracy in VACUUM's tuple count estimates