Re: checkpointer continuous flushing

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2016-01-20 10:13:26
Message-ID: 20160120101326.rvao4mcuntxxf7wf@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > This seems like a problem with the WAL writer quite independent of
> > anything else. It seems likely to be inadvertent fallout from this
> > patch:
> >
> > Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
> > Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000
> >
> > Wakeup WALWriter as needed for asynchronous commit performance.
> > Previously we waited for wal_writer_delay before flushing WAL. Now
> > we also wake WALWriter as soon as a WAL buffer page has filled.
> > Significant effect observed on performance of asynchronous commits
> > by Robert Haas, attributed to the ability to set hint bits on tuples
> > earlier and so reducing contention caused by clog lookups.
>
> In addition to that the "powersaving" effort also plays a role - without
> the latch we'd not wake up at any meaningful rate at all atm.

The relevant thread is at
http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
what I didn't remember is that I voiced concern back then about exactly this:
http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
;)

Simon: CCed you, as the author of the above commit. Quick summary:
The frequent wakeups of wal writer can lead to significant performance
regressions in workloads that are bigger than shared_buffers, because
the super-frequent fdatasync()s by the wal writer slow down concurrent
writes (bgwriter, checkpointer, individual backend writes)
dramatically. To the point that SIGSTOPing the wal writer gets a pgbench
workload from 2995 to 10887 tps. The reasons fdatasyncs cause a slow
down is that it prevents real use of queuing to the storage devices.

On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > If I understand correctly, prior to that commit, WAL writer woke up 5
> > times per second and flushed just that often (unless you changed the
> > default settings). But as the commit message explained, that turned
> > out to suck - you could make performance go up very significantly by
> > radically decreasing wal_writer_delay. This commit basically lets it
> > flush at maximum velocity - as fast as we finish one flush, we can
> > start the next. That must have seemed like a win at the time from the
> > way the commit message was written, but you seem to now be seeing the
> > opposite effect, where performance is suffering because flushes are
> > too frequent rather than too infrequent. I wonder if there's an ideal
> > flush rate and what it is, and how much it depends on what hardware
> > you have got.
>
> I think the problem isn't really that it's flushing too much WAL in
> total, it's that it's flushing WAL in a too granular fashion. I suspect
> we want something where we attempt a minimum number of flushes per
> second (presumably tied to wal_writer_delay) and, once exceeded, a
> minimum number of pages per flush. I think we even could continue to
> write() the data at the same rate as today, we just would need to reduce
> the number of fdatasync()s we issue. And possibly could make the
> eventual fdatasync()s cheaper by hinting the kernel to write them out
> earlier.
>
> Now the question what the minimum number of pages we want to flush for
> (setting wal_writer_delay triggered ones aside) isn't easy to answer. A
> simple model would be to statically tie it to the size of wal_buffers;
> say, don't flush unless at least 10% of XLogBuffers have been written
> since the last flush. More complex approaches would be to measure the
> continuous WAL writeout rate.
>
> By tying it to both a minimum rate under activity (ensuring things go to
> disk fast) and a minimum number of pages to sync (ensuring a reasonable
> number of cache flush operations) we should be able to mostly accomodate
> the different types of workloads. I think.

This unfortunately leaves out part of the reasoning for the above
commit: We want WAL to be flushed fast, so we immediately can set hint
bits.

One, relatively extreme, approach would be to continue *writing* WAL in
the background writer as today, but use rules like suggested above
guiding the actual flushing. Additionally using operations like
sync_file_range() (and equivalents on other OSs). Then, to address the
regression of SetHintBits() having to bail out more often, actually
trigger a WAL flush whenever WAL is already written, but not flushed.
has the potential to be bad in a number of other cases tho :(

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rushabh Lathia 2016-01-20 10:57:52 Re: Optimization for updating foreign tables in Postgres FDW
Previous Message Gasper Zejn 2016-01-20 09:55:31 Proposal for UPDATE: do not insert new tuple on heap if update does not change data