Re: streaming replication master can fail to shut down

From: Nick Cleaton <nick(at)cleaton(dot)net>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: streaming replication master can fail to shut down
Date: 2016-03-10 13:13:37
Message-ID: CAFgz3kv_6YA2FNXr=Eho6qhKY64KkVH8oDnhakjHHPM8VkA=7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 9 March 2016 at 16:17, Nick Cleaton <nick(at)cleaton(dot)net> wrote:
> I've found that the master server does not finish shutting down in
> response to SIGINT to the postmaster when a streaming replica is
> connected, if and only if all of the following hold:
>
> * The master server is a specific new bit of kit, which has slightly
> higher performance CPUs than the replica and is much more idle.
> * The master is configured with archive_mode=on and a non-empty archive command.
> * I do not have "strace -p ..." running on the master against the wal
> sender servicing the slow replica.

To debug the replication conversation during a failing master shutdown
with my 9.5.1 test installation, I've now also set ssl=false in the
server config and then:

* start a tcpdump
* start the master
* start the replica
* sleep 60
* kill -INT the postmaster on the master

A libpcap dump of the first 100k packets exchanged is at
http://nick.cleaton.net/protodump-100k-nossl.xz (800k compressed)

Highlights from pgshark: at master shutdown time there's:

Packet: t=1457611872.617496, session=17578864249730
PGSQL: type=XLogData, B -> F
XLOG DATA len=2896

... and everything following that is a storm of:

Packet: t=1457611872.761098, session=17578864249730
PGSQL: type=PrimaryKeepalive, B -> F
PRIMARY KEEP-ALIVE

Packet: t=1457611872.761221, session=17578864249730
PGSQL: type=StandbyStatusUpdate, F -> B
STANDBY STATUS UPDATE

Packet: t=1457611872.761363, session=17578864249730
PGSQL: type=PrimaryKeepalive, B -> F
PRIMARY KEEP-ALIVE

There were a total of 25486 StandbyStatusUpdate messages and 64769
PrimaryKeepalive messages during the 0.6 seconds between the start of
master shutdown and tcpdump hitting the 100k packet limit.

Relevant top and netstat output about 10 minutes into the blocked shutdown:

PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
master: 6167 postgres 20 0 251632 3020 1372 S 74.7 0.0
0:10.64 postgres: wal sender process postgres 10.122.82.98(49730)
streaming 0/11000098
replica: 28211 postgres 20 0 263100 4580 1964 R 100.0 0.0
0:21.26 postgres: wal receiver process streaming 0/11000000

Proto Recv-Q Send-Q Local Address Foreign Address
State
master: tcp 0 2553544 10.122.83.27:5432
10.122.82.98:49730 ESTABLISHED
replica: tcp 4284291 390 10.122.82.98:49730
10.122.83.27:5432 ESTABLISHED

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Teodor Sigaev 2016-03-10 13:46:21 Re: BUG #13440: unaccent does not remove all diacritics
Previous Message Oliver Seemann 2016-03-10 12:50:28 Undetected deadlock with pg_dump on hot_standby server