Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Boszormenyi Zoltan <zb(at)cybertec(dot)at>, Hari Babu <haribabu(dot)kommi(at)huawei(dot)com>, 'Fujii Masao' <masao(dot)fujii(at)gmail(dot)com>, 'Heikki Linnakangas' <hlinnakangas(at)vmware(dot)com>, 'Amit kapila' <amit(dot)kapila(at)huawei(dot)com>
Subject: Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
Date: 2013-01-16 07:48:17
Message-ID: 20130116074817.GA5011@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi.

This patch was marked "Needs review" with no reviewers in the ongoing
CF, so I decided to take a look at it. I see that Zoltan has posted a
review, so I've added him to the list.

But I took a look at the latest patch in any case. Here are some
comments, mostly cosmetic ones.

> diff -dcrpN postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml postgresql/doc/src/sgml/ref/pg_basebackup.sgml
> *** postgresql.orig/doc/src/sgml/ref/pg_basebackup.sgml 2013-01-05 17:34:30.742135371 +0100
> --- postgresql/doc/src/sgml/ref/pg_basebackup.sgml 2013-01-07 15:11:40.787007890 +0100
> *************** PostgreSQL documentation
> *** 400,405 ****
> --- 400,425 ----
> </varlistentry>
>
> <varlistentry>
> + <term><option>-r <replaceable class="parameter">interval</replaceable></option></term>
> + <term><option>--recvtimeout=<replaceable class="parameter">interval</replaceable></option></term>
> + <listitem>
> + <para>
> + time that receiver waits for communication from server (in seconds).
> + </para>
> + </listitem>
> + </varlistentry>

I would reword this as "The maximum time (in seconds) to wait for data
from the server (default: wait forever)".

> + <varlistentry>
> + <term><option>-t <replaceable class="parameter">interval</replaceable></option></term>
> + <term><option>--conntimeout=<replaceable class="parameter">interval</replaceable></option></term>
> + <listitem>
> + <para>
> + time that client wait for connection to establish with server (in seconds).
> + </para>
> + </listitem>
> + </varlistentry>

Likewise, "The maximum time (in seconds) to wait for a connection to the
server to succeed (default: wait forever)".

Same thing in pg_receivexlog.sgml. Also, there's trailing whitespace in
various places in these files (and elsewhere in the patch), which should
be fixed.

> diff -dcrpN postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c postgresql/src/bin/pg_basebackup/pg_basebackup.c
> *** postgresql.orig/src/bin/pg_basebackup/pg_basebackup.c 2013-01-05 17:34:30.778135625 +0100
> --- postgresql/src/bin/pg_basebackup/pg_basebackup.c 2013-01-07 15:16:24.610037886 +0100
> *************** bool streamwal = false;
> *** 45,50 ****
> --- 45,54 ----
> bool fastcheckpoint = false;
> bool writerecoveryconf = false;
> int standby_message_timeout = 10 * 1000; /* 10 sec = default */
> + int standby_recv_timeout = 60*1000; /* 60 sec = default */
> + char *standby_connect_timeout = NULL;

I don't really like standby_recv_timeout being an int and
standby_connect_timeout being a char *. I understand that it's so that
it can be assigned to "values[i]" in GetConnection(), but that reason is
very distant, and not obvious from this code at all.

That said, I don't know if it's really worth bothering with.

> + #define NAPTIME_PER_CYCLE 100 /* max sleep time between cycles (100ms) */

This probably needs a better comment. Why are we sleeping between
cycles? What cycles?

> + printf(_(" -r, --recvtimeout=INTERVAL time that receiver waits for communication from\n"
> + " server (in seconds)\n"));
> + printf(_(" -t, --conntimeout=INTERVAL time that client wait for connection to establish\n"
> + " with server (in seconds)\n"));

Same comments about wording apply, but perhaps there's no need to
mention the default.

> ! if (r == 0 || (r < 0 && errno == EINTR))
> ! {
> ! /*
> ! * Got a timeout or signal. Before Continuing the loop, check for timeout.
> ! */
> ! if (standby_recv_timeout > 0)
> ! {
> ! now = localGetCurrentTimestamp();

I'd make "now" local to this block, and get rid of the comment. The two
"if"s are perfectly clear. This applies to the same pattern in other
places in the patch as well.

> ! if (localTimestampDifferenceExceeds(last_recv_timestamp, now, standby_recv_timeout))
> ! {
> ! fprintf(stderr, _("%s: terminating DB File receive due to timeout\n"),

Better wording? "DB File receive" is confusing. Even something like
"Closing connection due to read timeout" would be better. Or perhaps
you can make it like the following message, slightly lower:

> ! if (PQconsumeInput(conn) == 0)
> ! {
> ! fprintf(stderr,
> ! _("%s: could not receive data from WAL Sender: %s"),
> ! progname, PQerrorMessage(conn));

…and in the former case, say "read timeout" instead of PQerrorMessage().

> ! /* Set the last reply timestamp */
> ! last_recv_timestamp = localGetCurrentTimestamp();
> !
> ! /* Some data is received, so go back read them in buffer*/
> ! continue;

No need for these comments.

> + /* Set the last reply timestamp */
> + last_recv_timestamp = localGetCurrentTimestamp();

Likewise (in various places).

> /*
> ! * Connect in replication mode to the server, Sending connect_timeout
> ! * as configured, there is no need for rw_timeout.
> */
> ! conn = GetConnection(standby_connect_timeout);

This comment is pretty confusing.

> * Connect to the server. Returns a valid PGconn pointer if connected,
> * or NULL on non-permanent error. On permanent error, the function will
> * call exit(1) directly.
> + * Set conn_timeout to PGconn structure if their value
> + * is not NULL.
> */
> PGconn *
> ! GetConnection(char *conn_timeout)

And this comment is just wrong.

The patch looks OK otherwise. Zoltan indicated that his tests were
successful, so I didn't retest. Marking "Waiting on author" again.

-- Abhijit

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Heikki Linnakangas 2013-01-16 10:31:54 Re: Review of "pg_basebackup and pg_receivexlog to use non-blocking socket communication", was: Re: Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown
Previous Message Tomonari Katsumata 2013-01-16 05:48:41 Re: BUG #7803: Replication Problem(no master is there)

Browse pgsql-hackers by date

  From Date Subject
Next Message Abhijit Menon-Sen 2013-01-16 08:21:18 CF3+4 (was Re: Parallel query execution)
Previous Message Tom Lane 2013-01-16 07:07:29 Re: Parallel query execution