Logical replication hangs up.

From: Aleš Zelený <zeleny(dot)ales(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Logical replication hangs up.
Date: 2018-11-03 01:22:03
Message-ID: CAODqTUYb24Mb4NCao86UJfTnJM7m4NyfxMY1EqfcmbyWaj0-kw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

we are suing logical replication on 10.4 and it now hangs. After some
timeout it is retarted again, replaying 18GB of data and then hangs (while
7GB of wals remains to be proceeded).

The backlog of 18GB comes from a failed migration adding new table to
replication while replication user was not granted to select the table.
This was clear from log files and once resolved by adding select privilege,
I thought that all will work as usual (same happened in test env. many
times and adding missing grant for select was sufficient to get it
working... these were issues on tests).

RDBMS Version:
PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit

Publication (master) instance error message from log:

2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884)
master_prod repusr 0 5/0 sub_eur idle [00000]:LOG: terminating walsender
process due to replication timeout

Subscription instance log:

2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2 0 3/0 [XX000]:ERROR:
could not receive data from WAL stream: SSL connection has been closed
unexpectedly
2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397 0 [00000]:LOG:
worker process: logical replication worker for subscription 37932 (PID
8657) exited with exit code 1

While it seems to be a network issue, it is may be not - we have checked
the network and even monitoring, all the time some packets were exchanged.

We do have 3 subscriptions(thus repl. slots) for one publication, one
subscriber instance is within same datacenter as master, remainig
subscribers are remote.

>> select * from pg_replication_slots
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
| slot_name | plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn |
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
| sub_usd | pgoutput | logical | 16421 | master_prod | False |
True | 16604 | | 5536488 | 426/AAE55A68 |
426/AAE55A68 |
| sub_cad | pgoutput | logical | 16421 | master_prod | False |
True | 22875 | | 5536488 | 426/AAE55A68 |
426/AAE55A68 |
| sub_eur | pgoutput | logical | 16421 | master_prod | False |
True | 16605 | | 5536488 | 426/AAE55A68 |
426/AAE55A68 |
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+

Once after the error occurs, wal senders are re/started and they reads wal
files until they reach restart_lsn wal file:

>> select pg_walfile_name('426/AAE55A68')
+--------------------------+
| pg_walfile_name |
+--------------------------+
| 0000000100000426000000AA |
+--------------------------+

# changing file names until they reach this one:
root(at)master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758 -p
5790| grep pg_wal
postgres 1560 postgres 10r REG 259,3 16777216 115766007
/pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
postgres 5758 postgres 10r REG 259,3 16777216 115766007
/pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
postgres 5790 postgres 10r REG 259,3 16777216 115766007
/pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069

At this moment sent_lsn stops growing and nothing is happening for a while.

select * from pg_stat_replication;

pid | usesysid | usename | application_name | client_addr |
client_hostname | client_port | backend_start |
backend_xmin | state | sent_lsn | write_lsn | flush_lsn |
replay_lsn | write_lag | flush_lag | replay_lag | sync_priority |
sync_state
-------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
16604 | 37868 | repusr | sub_usd | 192.168.21.11
| | 35010 | 2018-11-02 23:52:22.059157+00
| | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68 |
426/AAE55A68 | | | | 0 | async
16605 | 37868 | repusr | sub_eur | 192.168.23.11
| | 36388 | 2018-11-02 23:52:24.75038+00
| | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68 |
426/AAE55A68 | | | | 0 | async
12387 | 37868 | repusr | sub_cad | 192.168.22.11
| | 51982 | 2018-11-02 23:30:31.372404+00
| | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68 |
426/AAE55A68 | | | | 0 | async
(3 rows)

What I found being done behind the scenes while from pg_stat_replication
"nothing is happening":

On Master wal snaders strace look like:

open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap",
O_WRONLY|O_CREAT|O_APPEND, 0600) = 15
write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\0%(at)\0\0"...,
220) = 220
close(15) = 0

On subscription instance, strice for wal reciever looks like this:
recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
epoll_create1(EPOLL_CLOEXEC) = 10
epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP,
{u32=1781282352, u64=94512536630832}}) = 0
epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP,
{u32=1781282376, u64=94512536630856}}) = 0
epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP,
{u32=1781282400, u64=94512536630880}}) = 0
epoll_wait(10, [], 1, 1000) = 0

tcpdump:
21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF], proto
TCP (6), length 52)
192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab
(correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr
2060668460], length 0
21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF], proto
TCP (6), length 120)
192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum 0x3ecf
(incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options [nop,nop,TS
val 2060670963 ecr 3699603], length 68
21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF], proto
TCP (6), length 52)
192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda
(correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr
2060670963], length 0
21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF], proto
TCP (6), length 120)
192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum 0x3ecf
(incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options [nop,nop,TS
val 2060673465 ecr 3702105], length 68
21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF], proto
TCP (6), length 52)
192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09
(correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr
2060673465], length 0
21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF], proto
TCP (6), length 120)
192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum 0x3ecf
(incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options [nop,nop,TS
val 2060675968 ecr 3704608], length 68
21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF], proto
TCP (6), length 52)
192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737
(correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr
2060675968], length 0

Some checksum issues, but finally ack-ed. Tried to open psql session from
host running subscription instance to master instance and it worked like a
charm.

I've tried pg_waldump, but nothing what I'll be able to consider as error...
>> select pg_walfile_name('426/AAE55A68')
+--------------------------+
| pg_walfile_name |
+--------------------------+
| 0000000100000426000000AA |
+--------------------------+
/usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68
0000000100000426000000AA | less

rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488
latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488
latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488
latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488
latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488
latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn:
426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488
latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn:
426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH

I've also did the same for sent_lsn:
>> select pg_walfile_name('429/69E9CC60');
+--------------------------+
| pg_walfile_name |
+--------------------------+
| 000000010000042900000069 |
+--------------------------+

/usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60
000000010000042900000069 | less
rmgr: Heap len (rec/tot): 76/ 76, tx: 5536495, lsn:
429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED , blkref
#0: rel 1663/16421/38572 blk 40
rmgr: Heap len (rec/tot): 76/ 76, tx: 5536495, lsn:
429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED , blkref
#0: rel 1663/16421/38572 blk 40
rmgr: Heap len (rec/tot): 76/ 76, tx: 5536495, lsn:
429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED , blkref
#0: rel 1663/16421/38572 blk 40

The relation 38572 from list above is an ordinary non replicated table.

Any help or advice how to diagnose/get it working is highly appreciated.

Kind regards Ales Zeleny

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2018-11-03 03:56:58 Re: backend crash on DELETE, reproducible locally
Previous Message Bruno Wolff III 2018-11-02 21:01:34 Re: Should pg 11 use a lot more memory building an spgist index?