From: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila(at)huawei(dot)com> |
Cc: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown |
Date: | 2012-11-12 14:53:58 |
Message-ID: | CAHGQGwHxvOUPrtXDBMtDGHc7+7dEsF7G4GmfN2_CTPKeQXqe_Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
On Fri, Nov 9, 2012 at 3:03 PM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
> On Thursday, November 08, 2012 10:42 PM Fujii Masao wrote:
>> On Thu, Nov 8, 2012 at 5:53 PM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
>> wrote:
>> > On Thursday, November 08, 2012 2:04 PM Heikki Linnakangas wrote:
>> >> On 19.10.2012 14:42, Amit kapila wrote:
>> >> > On Thursday, October 18, 2012 8:49 PM Fujii Masao wrote:
>> >> >> Before implementing the timeout parameter, I think that it's
>> better
>> >> to change
>> >> >> both pg_basebackup background process and pg_receivexlog so that
>> they
>> >> >> send back the reply message immediately when they receive the
>> >> keepalive
>> >> >> message requesting the reply. Currently, they always ignore such
>> >> keepalive
>> >> >> message, so status interval parameter (-s) in them always must be
>> set
>> >> to
>> >> >> the value less than replication timeout. We can avoid this
>> >> troublesome
>> >> >> parameter setting by introducing the same logic of walreceiver
>> into
>> >> both
>> >> >> pg_basebackup background process and pg_receivexlog.
>> >> >
>> >> > Please find the patch attached to address the modification
>> mentioned
>> >> by you (send immediate reply for keepalive).
>> >> > Both basebackup and pg_receivexlog uses the same function
>> >> ReceiveXLogStream, so single change for both will address the issue.
>> >>
>> >> Thanks, committed this one after shuffling it around the changes I
>> >> committed yesterday. I also updated the docs to not claim that -s
>> option
>> >> is required to avoid timeout disconnects anymore.
>> >
>> > Thank you.
>> > However I think still the issue will not be completely solved.
>> > pg_basebackup/pg_receivexlog can still take long time to
>> > detect network break as they don't have timeout concept. To do that I
>> have
>> > sent one proposal which is mentioned at end of mail chain:
>> > http://archives.postgresql.org/message-
>> id/6C0B27F7206C9E4CA54AE035729E9C3828
>> > 53BBED(at)szxeml509-mbs
>> >
>> > Do you think there is any need to introduce such mechanism in
>> > pg_basebackup/pg_receivexlog?
>>
>> Are you planning to introduce the timeout mechanism in pg_basebackup
>> main process? Or background process? It's useful to implement both.
>
> By background process, you mean ReceiveXlogStream?
> For both.
>
> I think for background process, it can be done in a way similar to what we
> have done for walreceiver.
Yes.
> But I have some doubts for how to do for main process:
>
> Logic similar to walreceiver can not be used incase network goes down during
> getting other database file from server.
> The reason for the same is to receive the data files PQgetCopyData() is
> called in synchronous mode, so it keeps waiting for infinite time till it
> gets some data.
> In order to solve this issue, I can think of following options:
> 1. Making this call also asynchronous (but now sure about impact of this).
+1
Walreceiver already calls PQgetCopyData() asynchronously. ISTM you can
solve the issue in the similar way to walreceiver's.
> 2. In function pqWait, instead of passing hard-code value -1 (i.e. infinite
> wait), we can send some finite time. This time can be received as command
> line argument
> from respective utility and set the same in PGconn structure.
> In order to have timeout value in PGconn, we can have:
> a. Add new parameter in PGconn to indicate the receive timeout.
> b. Use the existing parameter connect_timeout for receive timeout
> also but this may lead to confusion.
> 3. Any other better option?
>
> Apart from above issue, there is possibility that if during connect time
> network goes down, then it might hang, because connect_timeout by default
> will be NULL and connectDBComplete will start waiting inifinitely for
> connection to become successful.
> So shall we have command line argument separately for this also or any other
> way as you suugest.
Yes, I think that we should add something like --conninfo option to
pg_basebackup
and pg_receivexlog. We can easily set not only connect_timeout but also sslmode,
application_name, ... by using such option accepting conninfo string.
>> BTW, IIRC the walsender has no timeout mechanism during sending
>> backup data to pg_basebackup. So it's also useful to implement the
>> timeout mechanism for the walsender during backup.
>
> Yes, its useful, but for walsender the main problem is that it uses blocking
> send call to send the data.
> I have tried using tcp_keepalive settings, but the send call doesn't comeout
> incase of network break.
> The only way I could get it out is:
> change in the corresponding file /proc/sys/net/ipv4/tcp_retries2 by using
> the command
> echo "8" > /proc/sys/net/ipv4/tcp_retries2
> As per recommendation, its value should be at-least 8 (equivalent to 100
> sec)
>
> Do you have any idea, how it can be achieved?
What about using pq_putmessage_noblock()?
Regards,
--
Fujii Masao
From | Date | Subject | |
---|---|---|---|
Next Message | Niels Kristian Schjødt | 2012-11-12 18:41:59 | Bug in postgres 9.2 installation |
Previous Message | Tom Lane | 2012-11-12 14:30:03 | Re: BUG #7653: Minor problem with join condition |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2012-11-12 15:00:05 | Re: Enabling Checksums |
Previous Message | Tom Lane | 2012-11-12 14:51:30 | Re: Inadequate thought about buffer locking during hot standby replay |