Re: walwriter interacts quite badly with synchronous_commit=off

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: walwriter interacts quite badly with synchronous_commit=off
Date: 2023-10-25 18:59:41
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2023-10-25 12:17:03 +0300, Heikki Linnakangas wrote:
> On 25/10/2023 02:09, Andres Freund wrote:
> > Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
> > and Latch->is_set, we also sometimes end up with multiple processes signalling
> > walwriter, which can be bad, because it increases the likelihood that some of
> > the signals may be received when we are already holding WALWriteLock, delaying
> > its release...
> That can only happen when walwriter has just come out of "hibernation", ie.
> when the system has been idle for a while. So probably not a big deal in
> practice.

Maybe I am missing something here - why can this only happen when hibernating?
Even outside of that two backends can decide that that they need to wake up

We could prevent that, by updating state when requesting walwriter to be woken
up. But with the changes we're discussing below, that should be rare.

> > I think the most important optimization we need is to have
> > XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
> > WAL. Unless walsender is hibernating, walsender will wake up on its own after
> > wal_writer_delay. I think we can just reuse WalWriterFlushAfter for this.
> >
> > E.g. a condition like
> > if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * XLOG_BLCKSZ)
> > return;
> > drastically cuts down on the amount of wakeups, without - I think - loosing
> > guarantees around synchronous_commit=off.
> In the patch, you actually did:
> > + if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * XLOG_BLCKSZ)
> > + return;
> It means that you never wake up the walwriter to merely *write* the WAL. You
> only wake it up if it's going to also fsync() it. I think that's correct and
> appropriate, but it took me a while to reach that conclusion:

Yea, after writing the email I got worried that just looking at Write would
perhaps lead to not flushing data soon enough...

> It might be beneficial to wake up the walwriter just to perform a write(),
> to offload that work from the backend. And walwriter will actually also
> perform an fsync() after finishing the current segment, so it would make
> sense to also wake it up when 'asyncXactLSN' crosses a segment boundary.
> However, if those extra wakeups make sense, they would also make sense when
> there are no asynchronous commits involved. Therefore those extra wakeups
> should be done elsewhere, perhaps somewhere around AdvanceXLInsertBuffer().
> The condition you have in the patch is appropriate for
> XLogSetAsyncXactLSN().

Yea. I agree we should wake up walsender in other situations too...

> Another reason to write the WAL aggressively, even if you don't flush it,
> would be to reduce the number of lost transactions on a segfault. But we
> don't give any guarantees on that, and even with the current aggressive
> logic, we only write when a page is full so you're anyway going to lose the
> last partial page.

Wal writer does end up writing the trailing partially filled page during the
next wal_writer_delay cycle.

> It also took me a while to convince myself that this calculation matches the
> calculation that XLogBackgroundFlush() uses to determine whether it needs to
> flush or not. XLogBackgroundFlush() first divides the request and result
> with XLOG_BLCKSZ and then compares the number of blocks, whereas here you
> perform the calculation in bytes. I think the result is the same, but to
> make it more clear, let's do it the same way in both places.
> See attached. It's the same logic as in your patch, just formulatd more
> clearly IMHO.

Yep, makes sense!

> > Because of the frequent wakeups, we do something else that's not smart: We
> > write out 8k blocks individually, many times a second. Often thousands of
> > 8k pwrites a second.
> Even with this patch, when I bumped up wal_writer_delay to 2 so that the wal
> writer gets woken up by the async commits rather than the timeout, the write
> pattern is a bit silly:
> $ strace -p 1099926 # walwriter
> strace: Process 1099926 attached
> epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
> u64=94261056289248}}], 1, 1991) = 1
> read(3,
> "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 128
> pwrite64(5, "\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"...,
> 1007616, 49152) = 1007616
> fdatasync(5) = 0
> pwrite64(5, "\24\321\5\0\1\0\0\0\0
> \20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384, 1056768)
> = 16384
> epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
> u64=94261056289248}}], 1, 2000) = 1
> read(3,
> "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 128
> pwrite64(5,
> "\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"...,
> 1040384, 1073152) = 1040384
> fdatasync(5) = 0
> pwrite64(5, "\24\321\4\0\1\0\0\0\0@
> \373\5\0\0\0\0\0\0\0\0\0\0\0;\0\0\0\264'\246\3"..., 16384, 2113536) = 16384
> epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
> u64=94261056289248}}], 1, 2000) = 1
> read(3,
> "\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 128
> pwrite64(5, "\24\321\5\0\1\0\0\0\0\200 \373\5\0\0\0003\0\0\0\0\0\0\0\320\177
> \373\5\0\0\0"..., 1040384, 2129920) = 1040384
> fdatasync(5) = 0
> In each cycle, the wal writer writes a full 1 MB chunk
> (wal_writer_flush_after = '1MB'), flushes it, and then perform a smaller
> write before going to sleep.

I think that's actually somewhat sane - we write out the partial page in the
subsequent cycle. That won't happen if the page isn't partially filled or
doesn't have an async commit on it.

I think we end up with somewhat bogus write patterns in other cases still, but
that's really more an issue in XLogBackgroundFlush() and thus deserves a
separate patch/thread.

> > I'm not addressing that here, but I think we also have the opposite behaviour
> > - we're not waking up walwriter often enough. E.g. if you have lots of bulk
> > dataloads, walwriter will just wake up once per wal_writer_delay, leading to
> > most of the work being done by backends. We should probably wake walsender at
> > the end of XLogInsertRecord() if there is sufficient outstanding WAL.
> Right, that's basically the same issue that I reasoned through above. I did
> some quick testing with a few different settings of wal_buffers,
> wal_writer_flush_after and wal_writer_delay, to try to see that effect. But
> I was not able to find a case where it makes a difference.

I think in the right set of circumstances it can make quite a bit of
difference. E.g. I bulk load 3GB of data in a cluster with s_b 1GB. Then I
checkpoint and VACUUM FREEZE it. With wal_writer_delay=1ms that's
considerably faster (5.4s) than with wal_writer_delay=2s (8.3s) or even the
default 200ms (7.9s), because a fast walwriter makes it much more likely that
vacuum won't need to wait for an xlog flush before replacing a buffer in the
strategy ring.

I think improving this logic would be quite worthwhile!

Another benefit of triggering wakeups based on the amount of outstanding
writes would be that we could increase wal_writer_delay substantially (with
perhaps some adjustment for the partial-trailing-page-with-async-commit case),
reducing power usage. It's imo pretty silly that we have wal writer wake up
regularly, if it just writes once every few seconds.


Andres Freund

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-10-25 19:02:40 Re: Custom tstzrange with importance factored in
Previous Message David Steele 2023-10-25 18:53:31 Remove dead code in pg_ctl.c