Re: checkpointer continuous flushing

From: Andres Freund <andres(at)anarazel(dot)de>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-10-20 14:13:07
Message-ID: 20151020141307.GH28038@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-10-19 21:14:55 +0200, Fabien COELHO wrote:
> >In my performance testing it showed that calling PerformFileFlush() only
> >at segment boundaries and in CheckpointWriteDelay() can lead to rather
> >spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
> >problematic because it only is triggered while on schedule, and not when
> >behind.
>
> When behind, the PerformFileFlush should be called on segment
> boundaries.

That means it's flushing up to a gigabyte of data at once. Far too
much. The implementation pretty always will go behind schedule for some
time. Since sync_file_range() doesn't flush in the foreground I don't
think it's important to do the flushing in concert with sleeping.

> >My testing seems to show that just adding a limit of 32 buffers to
> >FileAsynchronousFlush() leads to markedly better results.
>
> Hmmm. 32 buffers means 256 KB, which is quite small.

Why? The aim is to not overwhelm the request queue - which is where the
coalescing is done. And usually that's rather small. If you flush much more
sync_file_range starts to do work in the foreground.

> >I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
> >sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
> >might even be possible to later approximate that on windows using
> >FlushViewOfFile().
>
> I'm not sure that mmap/msync can be used for this purpose, because there is
> no real control it seems about where the file is mmapped.

I'm not following? Why does it matter where a file is mapped?

I have had a friend (Christian Kruse, thanks!) confirm that at least on
OSX msync(MS_ASYNC) triggers writeback. A freebsd dev confirmed that
that should be the case on freebsd too.

> >Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in
> >so many places looks ugly, I want to push that to the underlying
> >functions. If we add a different flushing approach we shouldn't have to
> >touch several places that don't actually really care.
>
> I agree that it is pretty ugly, but I do not think that you can remove them
> all.

Sure, never said all. But most.

> >I've replaced the NextBufferToWrite() logic with a binaryheap.h heap -
> >seems to work well, with a bit less code actually.
>
> Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
> set in most case is an improvement.

Yes, it'll not matter that much in many cases. But I rather disliked the
NextBufferToWrite() implementation, especially that it walkes the array
multiple times. And I did see setups with ~15 tablespaces.

> >I've also noticed that sleeping logic in CheckpointWriteDelay() isn't
> >particularly good. In high throughput workloads the 100ms sleep is too
> >long, leading to bursty IO behaviour. If 1k+ buffers a written out a
> >second 100ms is a rather long sleep. For another that we only sleep
> >100ms when the write rate is low makes the checkpoint finish rather
> >quickly - on a slow disk (say microsd) that can cause unneccesary
> >slowdowns for concurrent activity. ISTM we should calculate the sleep
> >time in a better way.
>
> I also noted this point, but I'm not sure how to have a better approach, so
> I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
> effect on performance for the test I ran then. The point of having not too
> small a value is that it provide some significant work to the IO subsystem
> without overflowing it.

I don't think that makes much sense. All a longer sleep achieves is
creating a larger burst of writes afterwards. We should really sleep
adaptively.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Albe Laurenz 2015-10-20 14:27:48 Documentation for min_wal_size and max_wal_size
Previous Message rajan 2015-10-20 14:02:45 Re: SuperUser check in pg_stat_statements