walwriter interacts quite badly with synchronous_commit=off

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: walwriter interacts quite badly with synchronous_commit=off
Date: 2023-10-24 23:09:29
Message-ID: 20231024230929.vsc342baqs7kmbte@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I recently mentioned to Heikki that I was seeing latch related wakeups being
frequent enough to prevent walwriter from doing a whole lot of work. He asked
me to write that set of concerns up, which seems quite fair...

Here's a profile of walwriter while the following pgbench run was ongoing:

c=1;psql -Xq -c 'drop table if exists testtable_logged; CREATE TABLE testtable_logged(v int not null default 0);' && PGOPTIONS='-c synchronous_commit=off' pgbench -n -c$c -j$c -Mprepared -T150 -f <(echo 'INSERT INTO testtable_logged DEFAULT VALUES;') -P1

Looking at top, walwriter is around 15-20% busy with this
workload. Unfortunately a profile quickly shows that little of that work is
useful:

perf record --call-graph dwarf -m16M -p $(pgrep -f 'walwriter') sleep 3

- 94.42% 0.00% postgres postgres [.] AuxiliaryProcessMain
AuxiliaryProcessMain
- WalWriterMain
+ 78.26% WaitLatch
+ 14.01% XLogBackgroundFlush
+ 0.51% pgstat_report_wal
0.29% ResetLatch
0.13% pgstat_flush_io
+ 0.02% asm_sysvec_apic_timer_interrupt
0.01% HandleWalWriterInterrupts (inlined)

Confirmed by the distribution of what syscalls are made:

perf trace -m128M --summary -p $(pgrep -f 'walwriter') sleep 5
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 216610 0 3744.984 0.000 0.017 0.113 0.03%
read 216602 0 333.905 0.001 0.002 0.029 0.03%
fdatasync 27 0 94.703 1.939 3.508 11.279 8.83%
pwrite64 2998 0 15.646 0.004 0.005 0.027 0.45%
openat 2 0 0.019 0.006 0.010 0.013 34.84%
close 2 0 0.004 0.002 0.002 0.003 25.76%

We're doing far more latch related work than actual work.

The walwriter many many times wakes up without having to do anything.

And if you increase the number of clients to e.g. c=8, it gets worse in some
ways:

perf trace:
epoll_wait 291512 0 2364.067 0.001 0.008 0.693 0.10%
read 290938 0 479.837 0.001 0.002 0.020 0.05%
fdatasync 146 0 410.043 2.508 2.809 7.006 1.90%
futex 56384 43982 183.896 0.001 0.003 2.791 1.65%
pwrite64 17058 0 105.625 0.004 0.006 4.015 4.61%
clock_nanosleep 1 0 1.063 1.063 1.063 1.063 0.00%
openat 9 0 0.072 0.006 0.008 0.014 14.35%
close 9 0 0.018 0.002 0.002 0.003 5.55%

Note that we 5x more lock waits (the futex calls) than writes!

I think the problem is mainly that XLogSetAsyncXactLSN() wakes up walwriter
whenever it is sleeping, regardless of whether the modified asyncXactLSN will
lead to a write. We even wake up walwriter when we haven't changed
asyncXactLSN, because our LSN is older than some other backends!

So often we'll just wake up walwriter, which finds no work, immediately goes
to sleep, just to be woken again.

Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...

Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.

We also acquire WALWriteLock and call WaitXLogInsertionsToFinish(), even if
could already know we're not going to flush! Not cheap, when you do it this
many times a second.

There is an absolutely basic optimization, helping a it at higher client
counts: Don't wake if the new asyncXactLSN is <= the old one. But it doesn't
help that much.

I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay. I think we can just reuse WalWriterFlushAfter for this.

E.g. a condition like
if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * XLOG_BLCKSZ)
return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.

1 client:

before:
tps = 42926.288765 (without initial connection time)

syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 209077 0 3746.918 0.000 0.018 0.143 0.03%
read 209073 0 310.532 0.001 0.001 0.021 0.02%
fdatasync 25 0 82.673 2.623 3.307 3.457 1.13%
pwrite64 2892 0 14.600 0.004 0.005 0.018 0.43%

after:

tps = 46244.394058 (without initial connection time)

syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 25 0 4732.625 0.000 189.305 200.281 4.17%
fdatasync 25 0 90.264 2.814 3.611 3.835 1.02%
pwrite64 48 0 15.825 0.020 0.330 0.707 12.76%
read 21 0 0.117 0.003 0.006 0.007 3.69%

8 clients:

tps = 279316.646315 (without initial connection time)

postgres (2861734), 1215159 events, 100.0%

syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 267517 0 2150.206 0.000 0.008 0.973 0.12%
read 266683 0 512.348 0.001 0.002 0.036 0.08%
fdatasync 149 0 413.658 2.583 2.776 3.395 0.29%
futex 56597 49588 183.174 0.001 0.003 1.047 0.69%
pwrite64 17516 0 126.208 0.004 0.007 2.927 3.93%

after:

tps = 290958.322594 (without initial connection time)

postgres (2861534), 1626 events, 100.0%

syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 153 0 4383.285 0.000 28.649 32.699 0.92%
fdatasync 153 0 464.088 2.452 3.033 19.999 4.88%
pwrite64 306 0 80.361 0.049 0.263 0.590 4.38%
read 153 0 0.459 0.002 0.003 0.004 1.37%
futex 49 46 0.211 0.002 0.004 0.038 17.05%

More throughput for less CPU, seems neat :)

I'm not addressing that here, but I think we also have the opposite behaviour
- we're not waking up walwriter often enough. E.g. if you have lots of bulk
dataloads, walwriter will just wake up once per wal_writer_delay, leading to
most of the work being done by backends. We should probably wake walsender at
the end of XLogInsertRecord() if there is sufficient outstanding WAL.

Greetings,

Andres Freund

Attachment Content-Type Size
v1-0001-xlog.c-Rename-misleadingly-named-variable.patch text/x-diff 1.3 KB
v1-0002-Reduce-rate-of-wakeups-of-walsender-due-to-async-.patch text/x-diff 2.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2023-10-25 00:11:05 Re: Row pattern recognition
Previous Message Michael Paquier 2023-10-24 22:17:55 Re: CRC32C Parallel Computation Optimization on ARM