Quick Links

Postgres 10, slave not catching up with master

From:	Boris Sagadin <boris(at)infosplet(dot)com>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Postgres 10, slave not catching up with master
Date:	2018-10-21 07:06:31
Message-ID:	CAEzn=HQZMKymHF=wBNekZ=eXm3WtqMqkR3bH9XdLwv4RXDHJbg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hello,

I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x 1.9TB
NVMe drive) AWS instance with about 5TB of disk space occupied, ext4,
Ubuntu 16.04.

Multi-tenant DB with about 40000 tables, insert heavy.

I started a new slave with identical HW specs, SR. DB started syncing from
master, which took about 4 hours, then it started applying the WALs.
However, it seems it can't catch up. Delay is still around 3 hours
(measured with now() - pg_last_xact_replay_timestamp()), even a day later.
It goes a few 100s up and down, but it seems to float around 3h mark.

Disk IO is low at about 10%, measured with iostat, no connected clients,
recovery process is at around 90% CPU single core usage.

Tried tuning the various parameters, but with no avail. Only thing I found
suspicious is stracing the recovery process constantly produces many errors
such as:

lseek(428, 0, SEEK_END) = 780124160
lseek(30, 0, SEEK_END) = 212992
read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
unavailable)
lseek(680, 0, SEEK_END) = 493117440
read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
unavailable)
lseek(774, 0, SEEK_END) = 583368704

...[snip]...

read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
unavailable)
lseek(774, 0, SEEK_END) = 583368704
read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
unavailable)
lseek(277, 0, SEEK_END) = 502882304
lseek(6, 516096, SEEK_SET) = 516096
read(6,
"\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"...,
8192) = 8192
read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
unavailable)
lseek(735, 0, SEEK_END) = 272809984
read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
unavailable)
lseek(277, 0, SEEK_END) = 502882304

ls -l fd/9
lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]

Perf top on recovery produces:

27.76% postgres [.] pglz_decompress
9.90% [kernel] [k] entry_SYSCALL_64_after_swapgs
7.09% postgres [.] hash_search_with_hash_value
4.26% libpthread-2.23.so [.] llseek
3.64% libpthread-2.23.so [.] __read_nocancel
2.80% [kernel] [k] __fget_light
2.67% postgres [.] 0x000000000034d3ba
1.85% [kernel] [k] ext4_llseek
1.84% postgres [.] pg_comp_crc32c_sse42
1.44% postgres [.] hash_any
1.35% postgres [.] 0x000000000036afad
1.29% postgres [.] MarkBufferDirty
1.21% postgres [.] XLogReadRecord
[...]

Tried changing the process limits with prlimit to unlimited, but no change.

I can turn off the WAL compression but I doubt this is the main culprit.
Any ideas appreciated.

Regards,
Boris

Responses

Re: Postgres 10, slave not catching up with master at 2018-10-21 13:52:50 from Andy Colson
Re: Postgres 10, slave not catching up with master at 2018-10-23 05:28:27 from Boris Sagadin

Browse pgsql-general by date

	From	Date	Subject
Next Message	Johannes Graën	2018-10-21 09:27:20	found xmin x from before relfrozenxid y
Previous Message	Peter J. Holzer	2018-10-20 20:14:08	Re: convert real to numeric.