Re: Postgres 10, slave not catching up with master

From: Boris Sagadin <boris(at)infosplet(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Postgres 10, slave not catching up with master
Date: 2018-10-23 05:28:27
Message-ID: CAEzn=HTeFUG1XApiyZL5-72jbbwiB6bi8PCrGsDxymugrhGZKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Yes, turning wal_compression off improves things. Slave that was mentioned
unfortunately lagged too much before this setting was applied and was
turned off. However the remaining slave lags less now, although still
occasionally up to a few minutes. I think single threadedness of recovery
is a big slowdown for write heavy databases. Maybe an option to increase
wal_size beyond 16MB in v11 will help.

In the meantime we'll solve this by splitting the DB to 2 or 3 clusters or
maybe trying out some sharding solution like Citus.

Boris

On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <boris(at)infosplet(dot)com> wrote:

> Hello,
>
> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x 1.9TB
> NVMe drive) AWS instance with about 5TB of disk space occupied, ext4,
> Ubuntu 16.04.
>
> Multi-tenant DB with about 40000 tables, insert heavy.
>
> I started a new slave with identical HW specs, SR. DB started syncing from
> master, which took about 4 hours, then it started applying the WALs.
> However, it seems it can't catch up. Delay is still around 3 hours
> (measured with now() - pg_last_xact_replay_timestamp()), even a day
> later. It goes a few 100s up and down, but it seems to float around 3h mark.
>
> Disk IO is low at about 10%, measured with iostat, no connected clients,
> recovery process is at around 90% CPU single core usage.
>
> Tried tuning the various parameters, but with no avail. Only thing I found
> suspicious is stracing the recovery process constantly produces many errors
> such as:
>
> lseek(428, 0, SEEK_END) = 780124160
> lseek(30, 0, SEEK_END) = 212992
> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
> unavailable)
> lseek(680, 0, SEEK_END) = 493117440
> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
> unavailable)
> lseek(774, 0, SEEK_END) = 583368704
>
> ...[snip]...
>
> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
> unavailable)
> lseek(774, 0, SEEK_END) = 583368704
> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
> unavailable)
> lseek(277, 0, SEEK_END) = 502882304
> lseek(6, 516096, SEEK_SET) = 516096
> read(6, "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"...,
> 8192) = 8192
> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
> unavailable)
> lseek(735, 0, SEEK_END) = 272809984
> read(9, 0x7ffe4001f557, 1) = -1 EAGAIN (Resource temporarily
> unavailable)
> lseek(277, 0, SEEK_END) = 502882304
>
> ls -l fd/9
> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]
>
>
> Perf top on recovery produces:
>
> 27.76% postgres [.] pglz_decompress
> 9.90% [kernel] [k] entry_SYSCALL_64_after_swapgs
> 7.09% postgres [.] hash_search_with_hash_value
> 4.26% libpthread-2.23.so [.] llseek
> 3.64% libpthread-2.23.so [.] __read_nocancel
> 2.80% [kernel] [k] __fget_light
> 2.67% postgres [.] 0x000000000034d3ba
> 1.85% [kernel] [k] ext4_llseek
> 1.84% postgres [.] pg_comp_crc32c_sse42
> 1.44% postgres [.] hash_any
> 1.35% postgres [.] 0x000000000036afad
> 1.29% postgres [.] MarkBufferDirty
> 1.21% postgres [.] XLogReadRecord
> [...]
>
> Tried changing the process limits with prlimit to unlimited, but no change.
>
> I can turn off the WAL compression but I doubt this is the main culprit.
> Any ideas appreciated.
>
> Regards,
> Boris
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrew Gierth 2018-10-23 09:17:43 Re: How to declare PG version for compiling extensions.
Previous Message Tom Lane 2018-10-22 18:12:18 Re: Are indices used for creating check constraints?