Skip site navigation (1) Skip section navigation (2)

Re: BUG #7534: walreceiver takes long time to detect n/w breakdown

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BUG #7534: walreceiver takes long time to detect n/w breakdown
Date: 2012-10-01 10:38:49
Message-ID: 506972B9.6060104@vmware.com (view raw or flat)
Thread:
Lists: pgsql-bugspgsql-hackers
On 21.09.2012 14:18, Amit kapila wrote:
> On Tuesday, September 18, 2012 6:02 PM Fujii Masao wrote:
> On Mon, Sep 17, 2012 at 4:03 PM, Amit Kapila<amit(dot)kapila(at)huawei(dot)com>  wrote:
>
>>> Approach-2 :
>>> Provide a variable wal_send_status_interval, such that if this is 0, then
>>> the current behavior would prevail and if its non-zero then KeepAlive
>>> message would be send maximum after that time.
>>> The modified code of WALSendLoop will be as follows:
>
> <snip>
>>> Which way you think is better or you have any other idea to handle.
>
>> I think #2 is better because it's more intuitive to a user.
>
> Please find a patch attached for implementation of Approach-2.

Hmm, I think we need to step back a bit. I've never liked the way 
replication_timeout works, where it's the user's responsibility to set 
wal_receiver_status_interval < replication_timeout. It's not very 
user-friendly. I'd rather not copy that same design to this walreceiver 
timeout. If there's two different timeouts like that, it's even worse, 
because it's easy to confuse the two.

So let's think how this should ideally work from a user's point of view. 
I think there should be just two settings: walsender_timeout and 
walreceiver_timeout. walsender_timeout specifies how long a walsender 
will keep a connection open if it doesn't hear from the walreceiver, and 
walreceiver_timeout is the same for walreceiver. The system should 
figure out itself how often to send keepalive messages so that those 
timeouts are not reached.

In walsender, after half of walsender_timeout has elapsed and we haven't 
received anything from the client, the walsender process should send a 
"ping" message to the client. Whenever the client receives a Ping, it 
replies. The walreceiver does the same; when half of walreceiver_timeout 
has elapsed, send a Ping message to the server. Each Ping-Pong roundtrip 
resets the timer in both ends, regardless of which side initiated it, so 
if e.g walsender_timeout < walreceiver_timeout, the client will never 
have to initiate a Ping message, because walsender will always reach the 
walsender_timeout/2 point first and initiate the heartbeat message.

The Ping/Pong messages don't necessarily need to be new message types, 
we can use the message types we currently have, perhaps with an 
additional flag attached to them, to request the other side to reply 
immediately.

- Heikki


In response to

Responses

pgsql-hackers by date

Next:From: Peter GeogheganDate: 2012-10-01 11:33:07
Subject: Re: Hash id in pg_stat_statements
Previous:From: Magnus HaganderDate: 2012-10-01 07:57:18
Subject: Hash id in pg_stat_statements

pgsql-bugs by date

Next:From: Andrew HastieDate: 2012-10-01 11:14:41
Subject: Re: BUG #6758: ./configure script sets HAVE_WCSTOMBS_L 1
Previous:From: Mr Dash FourDate: 2012-09-30 12:14:52
Subject: Re: BUG #7575: "between" does not work properly with inet/cidr addresses

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group