Re: Streaming replication connection break - unexpected EOF on standby connection

From: Fabio Pardi <f(dot)pardi(at)portavita(dot)eu>
To: Ganesh Korde <ganeshakorde(at)gmail(dot)com>
Cc: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Streaming replication connection break - unexpected EOF on standby connection
Date: 2018-07-17 08:14:32
Message-ID: 27b2e7bf-9821-dbbe-2f6c-de414422634f@portavita.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Yeah, well done!

Thanks for letting us know and glad my tips helped!

regards,

fabio pardi

On 07/17/2018 10:10 AM, Ganesh Korde wrote:
> Hi,
>      
>    Finally issue has been resolved and issue was with network. There
> were two issues as below.
>
> 1. We have two links between primary and secondary, VPN tunnel and 
> L2TP.  Multiple routes were configured for VPN and  L2TP from the
> Secondary to                 primary. Tunnel and link was always up. 
> But, ss the source from Secondary is coming through tunnel  towards
> Primary the connection was getting dropped         after  reaching the
> destination due to ambiguity on routes between L2TP and VPN tunnel. This
> issue has been resolved by allowing access via VPN tunnel         
>  removing  L2TP route.
>
> 2. After fixing the above, still disconnection was happening, but after
> specific time interval. It was due to certain negotiation time set IPSEC
> tunnel.
> So they have enabled auto negotiation on the IPSEC tunnel so it won’t
> wait from tunnel initiation from other end.
>
> So issues related disconnection has been resolved.
>
> Thanks Johannes and Fabio for your help.
>
> Regards,
> Ganesh.
>
> On Tue, Jul 3, 2018 at 5:37 PM Fabio Pardi <f(dot)pardi(at)portavita(dot)eu
> <mailto:f(dot)pardi(at)portavita(dot)eu>> wrote:
>
> Hi Ganesh,
>
> the logs you posted refer to timeouts in the connections.
>
> Your configuration tells that in case of network drop, the standby
> server will be the first to acknowledge it. That's because
> wal_receiver_timeout is < than wal_sender_timeout
>
> From the documentation:
>
> |wal_receiver_timeout| (|integer|)
>
> Terminate replication connections that are inactive longer than
> the specified number of milliseconds. This is useful for the
> receiving standby server to detect a primary node crash or
> network outage. A value of zero disables the timeout mechanism.
> This parameter can only be set in the |postgresql.conf| file or
> on the server command line. The default value is 60 seconds.
>
>
> That might explain why your secondary server calls for a RST.
>
> RST packages you posted are more a consequence, than a cause of your
> problem.
>
> I think that the RST is sent to acknowledge master that the
> connection should be closed due to timeout.
>
> What above goes together with the fact that the sequence number
> 1232664740 of the RST packet is retransmitted several times, meaning
> that it did not reach its destination at first.
>
> I would look more carefully to your network because I suspect the
> real problem might be there.
>
>
> regards,
>
> fabio pardi
>
>
>
>
> On 03/07/18 09:36, Ganesh Korde wrote:
>> Hi,
>>
>>     After analysis by network team, they found packets are getting
>> reset by Secondary server. Below are the logs.
>>
>> 782.822280 port7 in <Secondary_server_IP>.35918 ->
>> <Primary_server_IP>.5433: rst 1232664740
>>
>> 782.822310 wan2 out <Secondary_server_IP>.35918 ->
>> <Primary_server_IP>.5433: rst 1232664740
>>
>> 782.822313 port7 in <Secondary_server_IP>.35918 ->
>> <Primary_server_IP>.5433: rst 1232664740
>>
>> 782.822315 wan2 out <Secondary_server_IP>.35918 ->
>> <Primary_server_IP>.5433: rst 1232664740
>>
>> 782.822317 port7 in <Secondary_server_IP>.35918 ->
>> <Primary_server_IP>.5433: rst 1232664740
>>
>> 782.822319 wan2 out <Secondary_server_IP>.35918 ->
>> <Primary_server_IP>.5433: rst 1232664740
>>
>> 782.822345 port7 in <Secondary_server_IP>.35918 ->
>> <Primary_server_IP>.5433: rst 1232664740
>>
>>
>> But they didn't able to find why secondary generating reset
>> packet. There are no any devices between these servers which can
>> modify the packets.
>> Though both servers are on different firewall, but packets are
>> getting reset at secondary server and not at the firewall level,
>> we can see this in the log.
>>
>> Below points I would like to mention about application 
>> 1. This connection interruption happens in day time, when
>> transactions are little bit high. In day time, average
>> transactions per second are 5 (Inserts and processing). 
>> 2. We are not using connection pool, so each time request comes
>> app server creates new connection to db server and when processing
>> is done, app server disconnects. 
>>
>> We are now clue less why secondary server resetting the packets. 
>> Any help is highly appreciated. 
>>
>> Thanks & Regards,
>> Ganesh.
>>
>>
>>
>>
>> On Thu, Jun 28, 2018 at 3:38 PM Ganesh Korde
>> <ganeshakorde(at)gmail(dot)com <mailto:ganeshakorde(at)gmail(dot)com>> wrote:
>>
>> Hi  Johannes,
>>
>>   Thanks for your reply. We are using VPN Tunnel between these
>> two hosts. I will check with network team, with remaining
>> questions you mentioned and will get back.
>>
>> Thanks  & Regards,
>> Ganesh.
>>
>> On Wed, Jun 27, 2018 at 6:46 PM Johannes Truschnigg
>> <johannes(at)truschnigg(dot)info <mailto:johannes(at)truschnigg(dot)info>>
>> wrote:
>>
>> Hi Ganesh,
>>
>>
>> On Wed, Jun 27, 2018 at 06:37:25PM +0530, Ganesh Korde wrote:
>> > [...]
>> > 1. Because of what reason, " unexpected EOF on standby
>> connection" occurs
>> > on primary db server?
>> > 2. After replication disconnection, secondary should
>> immediately connect to
>> > primary, but it takes some time, what could be the
>> reason for this?
>>
>> From skimming the log, it seems to me that there is an
>> issue at the
>> socket/network level, which yields the "connection reset
>> by peer" eror
>> message.
>>
>> What is the network between these two hosts like? Is it a
>> WAN link; is a VPN
>> or SSH tunnel involved? Do you have other, long-running
>> TCP sessions between
>> these peers, and do they experience similar or other
>> problems? Do the hosts'
>> link-layer stats hint at problems, e. g. packet loss? Do
>> the hosts' kernels
>> leave a message hinting at L2 connectivity problems in
>> their debug ringbuffers
>> (`dmesg`) at the time you observe the replication drop out?
>>
>> --
>> with best regards:
>> - Johannes Truschnigg ( johannes(at)truschnigg(dot)info
>> <mailto:johannes(at)truschnigg(dot)info> )
>>
>> www:   https://johannes.truschnigg.info/
>> phone: +43 650 2 133337
>> xmpp:  johannes(at)truschnigg(dot)info
>> <mailto:johannes(at)truschnigg(dot)info>
>>
>> Please do not bother me with HTML-email or attachments.
>> Thank you.
>>
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message ramsiddu007 2018-07-17 09:30:43 65279 Invisible ASCII Character
Previous Message Ganesh Korde 2018-07-17 08:10:18 Re: Streaming replication connection break - unexpected EOF on standby connection