Re: Handing off SLRU fsyncs to the checkpointer

From: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "alvherre(at)2ndquadrant(dot)com" <alvherre(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Handing off SLRU fsyncs to the checkpointer
Date: 2020-08-27 08:48:33
Message-ID: VI1PR0701MB69603BF4F1EC2AF6E6ECF5C1F6550@VI1PR0701MB6960.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Thomas / hackers,

>> The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache (but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all):
>>
>> 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
>> ---copy_user_enhanced_fast_string
>> |--17.98%--copyin
>> [..]
>> | __pwrite_nocancel
>> | FileWrite
>> | mdwrite
>> | FlushBuffer
>> | ReadBuffer_common
>> | ReadBufferWithoutRelcache
>> | XLogReadBufferExtended
>> | XLogReadBufferForRedoExtended
>> | --17.57%--btree_xlog_insert
>
> To move these writes out of recovery's way, we should probably just
> run the bgwriter process during crash recovery. I'm going to look
> into that.

Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic - maybe we should change subject? However, to add some data to the separate bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way, better sorting), I've measured the same configuration as last time with still the same append-only WAL workload on NVMe and compared with various shared_buffers settings (and buffers description sizing from pg_shmem_allocations which as You stated is wrongly reported(?) which I'm stating only for reference just in case):

shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a]
shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a]
shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a]
shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929
shared_buffers=4GB buffers_desc=32MB 27.185, 1.166
shared_buffers=8GB buffers_desc=64MB 27.649, 1.088
shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b]
shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b]
shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217
shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349
shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284
shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413

The amount of WAL to be replayed was ~2.8GB. To me it looks like that
a) flushing dirty buffers by StartupXLog might be a real problem but please read-on.
b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's not that big and it's not degrading linearly as one would expect (??) L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine: 32K/32K/256K/46080K. Maybe the DB size is too small.

I've double-checked that in condition [a] (shared_buffers=128MB) there was a lot of FlushBuffer() invocations per second (perf stat -e probe_postgres:FlushBuffer -I 1000), e.g:
# time counts unit events
1.000485217 79,494 probe_postgres:FlushBuffer
2.000861366 75,305 probe_postgres:FlushBuffer
3.001263163 79,662 probe_postgres:FlushBuffer
4.001650795 80,317 probe_postgres:FlushBuffer
5.002033273 79,951 probe_postgres:FlushBuffer
6.002418646 79,993 probe_postgres:FlushBuffer
while at 1GB shared_buffers it sits nearly at zero all the time. So there is like 3x speed-up possible when StartupXLog wouldn't have to care too much about dirty buffers in the critical path (bgwriter would as you say?) at least in append-only scenarios. But ... I've checked some real systems (even older versions of PostgreSQL not doing that much of replication, and indeed it's e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge (>100GB, 0.5TB RAM) but they are *system-wide* numbers, work is really performed by bgwriter not by startup/recovering as in this redo-bench case when DB is open for reads and replicating-- so it appears that this isn't optimization for hot standby case , but for the DB-closed startup recovery/restart/disaster/PITR scenario).

As for the 24GB shared_buffers scenario where dirty buffers are not at all a problem with given top profile (output trimmed), again as expected:

13.41% postgres postgres [.] hash_search_with_hash_value
|--8.31%--BufTableLookup <- ReadBuffer_common <- ReadBufferWithoutRelcache
--5.11%--smgropen
|--2.77%--XLogReadBufferExtended
--2.34%--ReadBufferWithoutRelcache
7.88% postgres postgres [.] MarkBufferDirty

I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported or zeros. However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would expect BufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day.

>> __pread_nocancel
>> --11.44%--FileRead
>> mdread
>> ReadBuffer_common
>> ReadBufferWithoutRelcache
>> XLogReadBufferExtended
>> XLogReadBufferForRedoExtended
>
> For these reads, the solution should be WAL prefetching,(..) But... when combined with Andres's work-in-progress AIO stuff (..)

Yes, I've heard a thing or two about those :) I hope I'll be able to deliver some measurements sooner or later of those two together (AIO+WALprefetch).

-Jakub Wartak.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2020-08-27 09:24:37 Re: factorial function/phase out postfix operators?
Previous Message Ashutosh Sharma 2020-08-27 08:42:07 Re: Help needed configuring postgreSQL with xml support