Re: Logical replication hangs up.

From: Jerry Sievers <gsievers19(at)comcast(dot)net>
To: Aleš Zelený <zeleny(dot)ales(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication hangs up.
Date: 2018-11-03 16:02:08
Message-ID: 87wopugnen.fsf@jsievers.enova.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Aleš Zelený <zeleny(dot)ales(at)gmail(dot)com> writes:

> Hello,
>
> we are suing logical replication on 10.4  and it now hangs. After
> some timeout it is retarted again, replaying 18GB of data and then
> hangs (while 7GB of wals remains to be proceeded).

Timeout...

Have a look at the 2 setting wal sender/receiver timeout and you
probably need to raise the sender timeout value.

HTH

>
> The backlog of 18GB comes from a failed migration adding new table to
> replication while replication user was not granted to select the
> table. This was clear from log files and once resolved by adding
> select privilege, I thought that all will work as usual (same
> happened in test env. many times and adding missing grant for select
> was sufficient to get it working... these were issues on tests).
>
> RDBMS Version:
> PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu,
> compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609,
> 64-bit
>
> Publication (master) instance error message from log:
>
> 2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884)
> master_prod repusr 0 5/0 sub_eur idle [00000]:LOG:  terminating
> walsender process due to replication timeout
>
> Subscription instance log:
>
> 2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2    0 3/0  
> [XX000]:ERROR:  could not receive data from WAL stream: SSL
> connection has been closed unexpectedly
> 2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397    0   
> [00000]:LOG:  worker process: logical replication worker for
> subscription 37932 (PID 8657) exited with exit code 1
>
> While it seems to be a network issue, it is may be not - we have
> checked the network and even monitoring, all the time some packets
> were exchanged.
>
> We do have 3 subscriptions(thus repl. slots) for one publication, one
> subscriber instance is within same datacenter as master, remainig
> subscribers are remote.
>
>
>>> select * from pg_replication_slots
> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
> | slot_name  | plugin   | slot_type | datoid | database    |
> temporary | active | active_pid | xmin | catalog_xmin | restart_lsn 
> | confirmed_flush_lsn |
> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
> | sub_usd    | pgoutput | logical   | 16421  | master_prod |
> False     | True   | 16604      |      | 5536488      | 426/AAE55A68
> | 426/AAE55A68        |
> | sub_cad    | pgoutput | logical   | 16421  | master_prod |
> False     | True   | 22875      |      | 5536488      | 426/AAE55A68
> | 426/AAE55A68        |
> | sub_eur    | pgoutput | logical   | 16421  | master_prod |
> False     | True   | 16605      |      | 5536488      | 426/AAE55A68
> | 426/AAE55A68        |
> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>
>
> Once  after the error occurs, wal senders are re/started and they
> reads wal files until they reach restart_lsn wal file:
>
>>> select pg_walfile_name('426/AAE55A68')
> +--------------------------+
> | pg_walfile_name          |
> +--------------------------+
> | 0000000100000426000000AA |
> +--------------------------+
>
> # changing file names until they reach this one:
> root(at)master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758
> -p 5790| grep pg_wal
> postgres 1560 postgres   10r      REG  259,3  16777216 115766007 /
> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
> postgres 5758 postgres   10r      REG  259,3  16777216 115766007 /
> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
> postgres 5790 postgres   10r      REG  259,3  16777216 115766007 /
> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>
>
> At this moment sent_lsn stops growing and nothing is happening for a
> while.
>
> select * from pg_stat_replication;
>
>   pid  | usesysid | usename |  application_name  | client_addr   |
> client_hostname | client_port |         backend_start         |
> backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |
>   replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority |
> sync_state
> -------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
>  16604 |    37868 | repusr  | sub_usd            | 192.168.21.11 |
>                  |       35010 | 2018-11-02 23:52:22.059157+00 |
>               | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
> | 426/AAE55A68 |           |           |            |             0 |
> async
>  16605 |    37868 | repusr  | sub_eur            | 192.168.23.11 |
>                  |       36388 | 2018-11-02 23:52:24.75038+00  |
>               | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
> | 426/AAE55A68 |           |           |            |             0 |
> async
>  12387 |    37868 | repusr  | sub_cad            | 192.168.22.11 |
>                  |       51982 | 2018-11-02 23:30:31.372404+00 |
>               | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
> | 426/AAE55A68 |           |           |            |             0 |
> async
> (3 rows)
>
> What I found being done behind the scenes while from
> pg_stat_replication "nothing is happening":
>
> On Master wal snaders strace look like:
>
> open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap",
> O_WRONLY|O_CREAT|O_APPEND, 0600) = 15
> write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\
> 0%(at)\0\0"..., 220) = 220
> close(15)                               = 0
>
> On subscription instance, strice for wal reciever looks like this:
> recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource
> temporarily unavailable)
> epoll_create1(EPOLL_CLOEXEC)            = 10
> epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
> 1781282352, u64=94512536630832}}) = 0
> epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
> 1781282376, u64=94512536630856}}) = 0
> epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
> 1781282400, u64=94512536630880}}) = 0
> epoll_wait(10, [], 1, 1000)             = 0
>
> tcpdump:
> 21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab
> (correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr
> 2060668460], length 0
> 21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF],
> proto TCP (6), length 120)
>     192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
> 0x3ecf (incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options
> [nop,nop,TS val 2060670963 ecr 3699603], length 68
> 21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda
> (correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr
> 2060670963], length 0
> 21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF],
> proto TCP (6), length 120)
>     192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
> 0x3ecf (incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options
> [nop,nop,TS val 2060673465 ecr 3702105], length 68
> 21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09
> (correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr
> 2060673465], length 0
> 21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF],
> proto TCP (6), length 120)
>     192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
> 0x3ecf (incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options
> [nop,nop,TS val 2060675968 ecr 3704608], length 68
> 21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737
> (correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr
> 2060675968], length 0
>
> Some checksum issues, but finally ack-ed. Tried to open psql session
> from host running subscription instance to master instance and it
> worked like a charm.
>
>
> I've tried pg_waldump, but nothing what I'll be able to consider as
> error...
>>> select pg_walfile_name('426/AAE55A68')
> +--------------------------+
> | pg_walfile_name          |
> +--------------------------+
> | 0000000100000426000000AA |
> +--------------------------+
> /usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68
> 0000000100000426000000AA | less
>
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn:
> 426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH
>
>
> I've also did the same for sent_lsn:
>>> select pg_walfile_name('429/69E9CC60');
> +--------------------------+
> | pg_walfile_name          |
> +--------------------------+
> | 000000010000042900000069 |
> +--------------------------+
>
> /usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60
> 000000010000042900000069 | less
> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
> 429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED ,
> blkref #0: rel 1663/16421/38572 blk 40
> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
> 429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED ,
> blkref #0: rel 1663/16421/38572 blk 40
> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
> 429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED ,
> blkref #0: rel 1663/16421/38572 blk 40
>
> The relation 38572 from list above is an ordinary non replicated
> table.
>
> Any help or advice how to diagnose/get it working is highly
> appreciated.
>
> Kind regards Ales Zeleny
>
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres(dot)consulting(at)comcast(dot)net
p: 312.241.7800

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Karsten Hilbert 2018-11-03 16:03:14 Re: backend crash on DELETE, reproducible locally
Previous Message Tom Lane 2018-11-03 15:39:49 Re: backend crash on DELETE, reproducible locally