Re: streaming replication master can fail to shut down

From: Nick Cleaton <nick(at)cleaton(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: streaming replication master can fail to shut down
Date: 2016-04-29 07:05:51
Message-ID: CAFgz3ku0_B8g56kJ+NWQZsqcbP-+DKgAGH9WTjmUQT2BFMG2jQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 29 April 2016 at 04:38, Andres Freund <andres(at)anarazel(dot)de> wrote:

>> > I guess you have a fair amount of WAL traffic, and the receiver was
>> > behind a good bit?
>>
>> No, IIRC this was on the test cluster that I installed for the purpose
>> of replicating the problem under 9.5; it was essentially idle.
>
> The reason I'm asking is that I so far can't really replicate the issue
> so far. It's pretty clear that waiting_for_ping_response = true; is
> needed, but I'm suspicious that that's not all.
>
> Was your standby on a separate machine?

Yes, I've only seen it happen when the standby was on a machine with
slower CPU cores than the primary. All my attempts to replicate it on
a single machine by trying to slow down the wal receiver have failed.
I'm fairly convinced it's some sort of race that depends on wal sender
+ network being faster than wal receiver.

> What kind of latency?

1G switches.

root(at)XXX:~# ping XXX
PING XXX) 56(84) bytes of data.
64 bytes from XXX: icmp_seq=1 ttl=64 time=0.162 ms
64 bytes from XXX: icmp_seq=2 ttl=64 time=0.223 ms
64 bytes from XXX: icmp_seq=3 ttl=64 time=0.122 ms
64 bytes from XXX: icmp_seq=4 ttl=64 time=0.126 ms
64 bytes from XXX: icmp_seq=5 ttl=64 time=0.149 ms

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Magnus Hagander 2016-04-29 07:16:58 Re: streaming replication master can fail to shut down
Previous Message Andres Freund 2016-04-29 03:38:18 Re: streaming replication master can fail to shut down