Re: Handing off SLRU fsyncs to the checkpointer

From: Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Handing off SLRU fsyncs to the checkpointer
Date: 2020-08-25 09:16:03
Message-ID: VI1PR0701MB696023DA7815207237196DC8F6570@VI1PR0701MB6960.eurprd07.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> [patch]

Hi Thomas / hackers,

I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch [2] as both are related to the WAL recovery performance in which I'm interested in. This is my first message to this mailing group so please let me know also if I should adjust testing style or formatting.

With both of those patches applied:
make check -> Passes
make check-world -> Passes
make standbycheck (section "Testing Hot Standby" from docs) -> Passes
There wasn't a single segfault or postmaster crash during the tests.
Review of the patches itself: I'm not qualified to review the PostgreSQL internals.

I've used redo-bench scripts [1] by Thomas to measure the performance effect (this approach simplifies testing and excludes network jittering effects): 1st column is redo start->end timing, 2nd is redo end -> end of checkpointing timing before opening the DB for reads. I've conducted 2-3 separate tests that show benefits of those patches depending on the workload:
- Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower / higer latency storage by ~20%
- Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by ~20%

TEST1: workload profile test as per standard TPC-B pgbench -c8 -j8, with majority of records in WAL stream being Heap/HOT_UPDATE:

xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
72.991, 0.919
73.688, 1.027
72.228, 1.032

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
72.271, 0.857
72.717, 0.748
72.705, 0.81

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
72.49, 0.103
74.069, 0.102
73.368, 0.102

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
70.312, 0.22
70.615, 0.201
69.739, 0.194

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
69.686, 0.101
70.601, 0.102
70.042, 0.101

As Thomas stated in the standard pgbench workload profile on recovery side there is compactify_tuples()->pg_qsort() overhead visible. So this is where the 2nd patch helps:

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement
58.85, 0.296
58.605, 0.269
58.603, 0.277

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement, fsync=off
58.618, 0.103
57.693, 0.101
58.779, 0.111

In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal way:

26.68% postgres postgres [.] qsort_itemoff
---qsort_itemoff
|--14.17%--qsort_itemoff
| |--10.96%--compactify_tuples
| | PageRepairFragmentation
| | heap2_redo
| | StartupXLOG
| --3.21%--qsort_itemoff
| --3.10%--compactify_tuples
| PageRepairFragmentation
| heap2_redo
| StartupXLOG
--12.51%--compactify_tuples
PageRepairFragmentation
heap2_redo
StartupXLOG

8.38% postgres libc-2.17.so [.] __memmove_ssse3_back
---__memmove_ssse3_back
compactify_tuples
PageRepairFragmentation
heap2_redo
StartupXLOG

6.51% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--3.62%--smgropen
| |--2.17%--XLogReadBufferExtended
| | --1.76%--XLogReadBufferForRedoExtended
| | --0.93%--heap_xlog_update
| --1.45%--ReadBufferWithoutRelcache
| XLogReadBufferExtended
| --1.34%--XLogReadBufferForRedoExtended
| --0.72%--heap_xlog_update
--2.69%--BufTableLookup
ReadBuffer_common
ReadBufferWithoutRelcache
XLogReadBufferExtended
--2.48%--XLogReadBufferForRedoExtended
|--1.34%--heap2_redo
| StartupXLOG
--0.83%--heap_xlog_update

So to sum, HOT update-like workload profile tends to be CPU bound on single process recovery side. Even slow storage (like xvda) was not the bottleneck here as I've used already hot stuff from VFS cache.

TEST2: The second suite of tests used append-only workload profile (the same amount of transactions, using pgbench -c8 -j8 -f insert_only.sql -n -t 1000000), however with simpler structure:
CREATE TABLE t (c1 uuid NOT NULL, c2 uuid NOT NULL, c3 integer NOT NULL, c4 integer NOT NULL);
CREATE INDEX i1 ON t USING btree (c1);
CREATE INDEX i2 ON t USING btree (c2);
CREATE INDEX i3 ON t USING btree (c3);
CREATE EXTENSION "uuid-ossp"

and customized script for pgbench, just this: BEGIN; INSERT INTO t (c1,c2,c3,c4) values ( uuid_generate_v4(), uuid_generate_v4(), round(5000*random()), 500); END;

Majority of WAL records being Btree/INSERT_LEAF (60%), results are following:

xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
120.732, 5.275
120.608, 5.902
120.685, 5.872

xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
99.061, 7.071
99.472, 6.386
98.036, 5.994

xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
102.838, 0.136
102.711, 0.099
103.093, 0.0970001

NVMe, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
96.46, 0.405
96.121, 0.405
95.951, 0.402

NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
94.124, 0.387
96.61, 0.416
94.624, 0.451

NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
95.401, 0.0969999
95.028, 0.099
94.632, 0.0970001

So apparently the v2-0001-Defer-flushing-of-SLRU-files helps in my case on higher latency storage.

The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache (but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all):

29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
---copy_user_enhanced_fast_string
|--17.98%--copyin
[..]
| __pwrite_nocancel
| FileWrite
| mdwrite
| FlushBuffer
| ReadBuffer_common
| ReadBufferWithoutRelcache
| XLogReadBufferExtended
| XLogReadBufferForRedoExtended
| --17.57%--btree_xlog_insert
| btree_redo
| StartupXLOG
|
--11.64%--copyout
[..]
__pread_nocancel
--11.44%--FileRead
mdread
ReadBuffer_common
ReadBufferWithoutRelcache
XLogReadBufferExtended
XLogReadBufferForRedoExtended
--11.34%--btree_xlog_insert
btree_redo
StartupXLOG

5.62% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--1.74%--smgropen
[..]

No# of syscalls/s topped at ~100k/s @ 8kB for each read & write respectively (it's logical I/O as everything fits in VFS cache, nearly no physical I/O). It was also visible during the test that the startup/recovering process spent 60% of it's time in %sys time in such conditions. As there was no sorting visible in profiler, I've did not test the workload with compactify_tuples()->pg_qsort() performance improvement here, although from basic runs it appears it did not introduce any degradation.

TEST2b: Quite frankly, at 1st glance I've did not seem to understand why btree_xlog_insert()->ReadBuffer_common() would require FlushBuffer() that would write, until I've bumped shared buffers 128MB -> 24GB as it might have been flushing dirty buffers which caused those pwrite's() - it's not evident as direct call in code from ReadBuffer*() (inlining? any idea how to monitor on standby the cycling of dirty buffers during recovery when there is no bgwriter yet?). Still I wanted to eliminate storage and VFS cache as bottleneck. This lead me to results like below (with fsync=on, also defer patch, shared_buffers=24GB to eliminate VFS cache lookups):
27.341, 0.858
27.26, 0.869
26.959, 0.86

Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the next sets of append-only bottlenecks is like below:

14.69% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--9.80%--BufTableLookup
| ReadBuffer_common
| ReadBufferWithoutRelcache
| XLogReadBufferExtended
| XLogReadBufferForRedoExtended
| |--7.76%--btree_xlog_insert
| | btree_redo
| | StartupXLOG
| --1.63%--heap_xlog_insert
--4.90%--smgropen
|--2.86%--ReadBufferWithoutRelcache
| XLogReadBufferExtended
| |--2.04%--XLogReadBufferForRedoExtended
| | |--1.22%--btree_xlog_insert
| | | btree_redo
| | | StartupXLOG
| | --0.82%--heap_xlog_insert
| --0.82%--XLogRecordPageWithFreeSpace
| heap_xlog_insert
--2.04%--XLogReadBufferExtended
XLogReadBufferForRedoExtended
--1.63%--btree_xlog_insert
btree_redo
StartupXLOG

7.76% postgres libc-2.17.so [.] __memmove_ssse3_back
---__memmove_ssse3_back
PageAddItemExtended
btree_xlog_insert
btree_redo
StartupXLOG

Still the result seems to be nice, as it is 10^6 (trx) / 27s = ~34k TPS on recovery side (~3000MB of WAL/~27s = ~111MB/s without FPW) provided there would be no I/O overheads and my assumptions. For full picture, compare ratio generation of "primary" data with INSERTs I was able to get without any special tuning:
- ~29k TPS of INSERTs-only/COMMITs with pgbench -c8
- ~49k TPS of INSERTs-only/COMMITs with pgbench -c16
so the WAL (single process) recovery code in 14master seems to have like 70% of performance of fairly low-end primary 16vCPU DB in above append-only conditions.

Specs: Linux 4.14 kernel, ext4 filesystems (data=ordered,noatime), 1s8c16t Xeon CPU E5-2686 v4 @ 2.30GHz, 128GB RAM, gcc 7.2.1, CFLAGS set by ./configure to "-O2", test on default/tiny shared_buffers until last test.

xvda AKA slow storage: root file system, single thread tests:
dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==> 157 MB/s
dd if=/dev/zero of=test bs=8k count=10000 oflag=direct => 17.2 MB/s

NVMe: striped VG consisting of 2x NVMes devices with much lower latency
dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==> 1.9 GB/s or maybe even more
dd if=/dev/zero of=test bs=8k count=10000 oflag=direct => 141 MB/s

-Jakub Wartak.

[1] - https://github.com/macdice/redo-bench/
[2] - https://commitfest.postgresql.org/29/2687/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2020-08-25 09:18:26 Re: proposal - function string_to_table
Previous Message Daniel Gustafsson 2020-08-25 09:12:17 Re: Asymmetric partition-wise JOIN