Skip site navigation (1) Skip section navigation (2)

Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Boszormenyi Zoltan <zb(at)cybertec(dot)at>,Hari Babu <haribabu(dot)kommi(at)huawei(dot)com>,'Fujii Masao' <masao(dot)fujii(at)gmail(dot)com>,'Heikki Linnakangas' <hlinnakangas(at)vmware(dot)com>,'Amit kapila' <amit(dot)kapila(at)huawei(dot)com>
Subject: Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
Date: 2013-01-16 07:48:17
Message-ID: 20130116074817.GA5011@toroid.org (view raw or flat)
Thread:
Lists: pgsql-bugspgsql-hackers
Hi.

This patch was marked "Needs review" with no reviewers in the ongoing
CF, so I decided to take a look at it. I see that Zoltan has posted a
review, so I've added him to the list.

But I took a look at the latest patch in any case. Here are some
comments, mostly cosmetic ones.

> diff -dcrpN postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml postgresql/doc/src/sgml/ref/pg_basebackup.sgml
> *** postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml	2013-01-05 17:34:30.742135371 +0100
> --- postgresql/doc/src/sgml/ref/pg_basebackup.sgml	2013-01-07 15:11:40.787007890 +0100
> *************** PostgreSQL documentation
> *** 400,405 ****
> --- 400,425 ----
>        </varlistentry>
>   
>        <varlistentry>
> +       <term><option>-r <replaceable class="parameter">interval</replaceable></option></term>
> +       <term><option>--recvtimeout=<replaceable class="parameter">interval</replaceable></option></term>
> +       <listitem>
> +        <para>
> +         time that receiver waits for communication from server (in seconds).
> +        </para>
> +       </listitem>
> +      </varlistentry>     

I would reword this as "The maximum time (in seconds) to wait for data
from the server (default: wait forever)".

> +      <varlistentry>
> +       <term><option>-t <replaceable class="parameter">interval</replaceable></option></term>
> +       <term><option>--conntimeout=<replaceable class="parameter">interval</replaceable></option></term>
> +       <listitem>
> +        <para>
> +         time that client wait for connection to establish with server (in seconds).
> +        </para>
> +       </listitem>
> +      </varlistentry>          

Likewise, "The maximum time (in seconds) to wait for a connection to the
server to succeed (default: wait forever)".

Same thing in pg_receivexlog.sgml. Also, there's trailing whitespace in
various places in these files (and elsewhere in the patch), which should
be fixed.

> diff -dcrpN postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c postgresql/src/bin/pg_basebackup/pg_basebackup.c
> *** postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c	2013-01-05 17:34:30.778135625 +0100
> --- postgresql/src/bin/pg_basebackup/pg_basebackup.c	2013-01-07 15:16:24.610037886 +0100
> *************** bool		streamwal = false;
> *** 45,50 ****
> --- 45,54 ----
>   bool		fastcheckpoint = false;
>   bool		writerecoveryconf = false;
>   int			standby_message_timeout = 10 * 1000;		/* 10 sec = default */
> + int		standby_recv_timeout = 60*1000;		/* 60 sec = default */
> + char		*standby_connect_timeout = NULL;		

I don't really like standby_recv_timeout being an int and
standby_connect_timeout being a char *. I understand that it's so that
it can be assigned to "values[i]" in GetConnection(), but that reason is
very distant, and not obvious from this code at all.

That said, I don't know if it's really worth bothering with.

> + #define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */

This probably needs a better comment. Why are we sleeping between
cycles? What cycles?

> + 	printf(_("  -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
> + 		   "                             server (in seconds)\n"));
> + 	printf(_("  -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
> + 		   "                             with server (in seconds)\n"));	

Same comments about wording apply, but perhaps there's no need to
mention the default.

> ! 			if (r == 0 || (r < 0 && errno == EINTR))
> ! 			{
> ! 				/*
> ! 				 * Got a timeout or signal. Before Continuing the loop, check for timeout.
> ! 				 */
> ! 				if (standby_recv_timeout > 0)
> ! 				{
> ! 					now = localGetCurrentTimestamp();

I'd make "now" local to this block, and get rid of the comment. The two
"if"s are perfectly clear. This applies to the same pattern in other
places in the patch as well.

> ! 					if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
> ! 					{
> ! 						fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),

Better wording? "DB File receive" is confusing. Even something like
"Closing connection due to read timeout" would be better. Or perhaps
you can make it like the following message, slightly lower:

> ! 			if (PQconsumeInput(conn) == 0)
> ! 			{
> ! 				fprintf(stderr,
> ! 						_("%s: could not receive data from WAL Sender: %s"),
> ! 						progname, PQerrorMessage(conn));

…and in the former case, say "read timeout" instead of PQerrorMessage().

> ! 			/* Set the last reply timestamp */
> ! 			last_recv_timestamp = localGetCurrentTimestamp();
> ! 
> ! 			/* Some data is received, so go back read them in buffer*/			
> ! 			continue;

No need for these comments.

> + 	/* Set the last reply timestamp */
> + 	last_recv_timestamp = localGetCurrentTimestamp();

Likewise (in various places).

>   	/*
> ! 	 * Connect in replication mode to the server, Sending connect_timeout
> ! 	 * as configured, there is no need for rw_timeout.
>   	 */
> ! 	conn = GetConnection(standby_connect_timeout);

This comment is pretty confusing.

>    * Connect to the server. Returns a valid PGconn pointer if connected,
>    * or NULL on non-permanent error. On permanent error, the function will
>    * call exit(1) directly.
> +  * Set conn_timeout to PGconn structure if their value 
> +  * is not NULL.
>    */
>   PGconn *
> ! GetConnection(char *conn_timeout)

And this comment is just wrong.

The patch looks OK otherwise. Zoltan indicated that his tests were
successful, so I didn't retest. Marking "Waiting on author" again.

-- Abhijit


In response to

pgsql-hackers by date

Next:From: Abhijit Menon-SenDate: 2013-01-16 08:21:18
Subject: CF3+4 (was Re: Parallel query execution)
Previous:From: Tom LaneDate: 2013-01-16 07:07:29
Subject: Re: Parallel query execution

pgsql-bugs by date

Next:From: Heikki LinnakangasDate: 2013-01-16 10:31:54
Subject: Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
Previous:From: Tomonari KatsumataDate: 2013-01-16 05:48:41
Subject: Re: BUG #7803: Replication Problem(no master is there)

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group