Re: [HACKERS] BUG #7534: walreceiver takes long time to detect n/w breakdown

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Robert Haas'" <robertmhaas(at)gmail(dot)com>
Cc: "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>, "'Fujii Masao'" <masao(dot)fujii(at)gmail(dot)com>, <pgsql-bugs(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] BUG #7534: walreceiver takes long time to detect n/w breakdown
Date: 2012-10-09 13:04:31
Message-ID: 00ae01cda61e$9fe90290$dfbb07b0$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tuesday, October 09, 2012 6:00 PM Robert Haas wrote:
> On Mon, Oct 8, 2012 at 10:42 AM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
> wrote:
> > How about following:
> > 1. replication_client_timeout -- shouldn't it be client as new
> configuration
> > is for wal receiver
> > 2. replication_standby_timeout
>
> ISTM that the client and the standby are the same thing.

Yeah same, but may be one (replication_standby_timeout) can be more easily
understandable by user.


> > If we introduce a new parameter for wal receiver, wouldn't
> > replication_timeout be confusing for user?
>
> Maybe.

> I actually don't think that I understand what problem we're
> trying to solve here. If the connection between the master and the
> standby is lost, shouldn't the standby realize that it's no longer
> receiving keepalives from the master and terminate the connection?

For wal receiver keepalives are also like one kind of message, so the
behavior is such that when it checks
that it doesn't receive any message, it tries to send reply/feedback message
to master after an interval of
wal_receiver_status_interval.
So after every wal_receiver_status_interval, wal receiver sends a reply, but
still the socket send doesn't
fail. It fails only after many send calls as internally might be in send(),
until the sockets internal buffer is full, it keeps accumulating even if
other side recv has not received the data.
So that's the reason we decided to introduce a timeout parameter in wal
receiver similar to what we have currently in walsender.

> I
> thought I had tested this at some point and it was working, so either
> it's subsequently gotten broken again or the scenario you're talking
> about is different in some way that I don't currently understand.

Standby takes quite longer around 15 minutes to detect whereas master is
able to
detect quite sooner in 2-3 mins and master also mainly detects due to
timeout functionality in wal sender.

With Regards,
Amit Kapila.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message hrtlik 2012-10-09 14:20:40 BUG #7590: Data corruption using pg_dump only with -Z parameter
Previous Message Robert Haas 2012-10-09 12:29:52 Re: [HACKERS] BUG #7534: walreceiver takes long time to detect n/w breakdown

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2012-10-09 13:42:20 Behavior for crash recovery when it detects a corrupt WAL record
Previous Message Albe Laurenz 2012-10-09 12:48:09 Re: Bad Data back Door