Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-10-21 05:49:23
Message-ID: alpine.DEB.2.10.1510210727580.11852@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Andres,

>>> In my performance testing it showed that calling PerformFileFlush() only
>>> at segment boundaries and in CheckpointWriteDelay() can lead to rather
>>> spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
>>> problematic because it only is triggered while on schedule, and not when
>>> behind.
>>
>> When behind, the PerformFileFlush should be called on segment
>> boundaries.
>
> That means it's flushing up to a gigabyte of data at once. Far too
> much.

Hmmm. I do not get it. There would not be gigabytes, there would be as
much as was written since the last sleep, about 100 ms ago, which is not
likely to be gigabytes?

> The implementation pretty always will go behind schedule for some
> time. Since sync_file_range() doesn't flush in the foreground I don't
> think it's important to do the flushing in concert with sleeping.

For me it is important to avoid accumulating too large flushes, and that
is the point of the call before sleeping.

>>> My testing seems to show that just adding a limit of 32 buffers to
>>> FileAsynchronousFlush() leads to markedly better results.
>>
>> Hmmm. 32 buffers means 256 KB, which is quite small.
>
> Why?

Because the point of sorting is to generate sequential writes so that the
HDD has a lot of aligned stuff to write without moving the head, and 32 is
rather small for that.

> The aim is to not overwhelm the request queue - which is where the
> coalescing is done. And usually that's rather small.

That is an argument. How small, though? It seems to be 128 by default, so
I'd rather have 128? Also, it can be changed, so maybe it should really be
a guc?

> If you flush much more sync_file_range starts to do work in the
> foreground.

Argh, too bad. I would have hoped that the would just deal with in an
asynchronous way, this is not a "fsync" call, just a flush advise.

>>> I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
>>> sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
>>> might even be possible to later approximate that on windows using
>>> FlushViewOfFile().
>>
>> I'm not sure that mmap/msync can be used for this purpose, because there is
>> no real control it seems about where the file is mmapped.
>
> I'm not following? Why does it matter where a file is mapped?

Because it should be in shared buffers where pg needs it? You probably
should not want to mmap all pg data files in user space for a large
database? Or if so, currently the OS keeps the data in memory if it has
enough space, but if you got to mmap this cache management would be pg
responsability, if I understand correctly mmap and your intentions.

> I have had a friend (Christian Kruse, thanks!) confirm that at least on
> OSX msync(MS_ASYNC) triggers writeback. A freebsd dev confirmed that
> that should be the case on freebsd too.

Good. My concern is how mmap could be used, though, not the flushing part.

>> Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
>> set in most case is an improvement.
>
> Yes, it'll not matter that much in many cases. But I rather disliked the
> NextBufferToWrite() implementation, especially that it walkes the array
> multiple times. And I did see setups with ~15 tablespaces.

ISTM that it is rather an argument for taking the tablespace into the
sorting, not necessarily for a binary heap.

>> I also noted this point, but I'm not sure how to have a better approach, so
>> I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
>> effect on performance for the test I ran then. The point of having not too
>> small a value is that it provide some significant work to the IO subsystem
>> without overflowing it.
>
> I don't think that makes much sense. All a longer sleep achieves is
> creating a larger burst of writes afterwards. We should really sleep
> adaptively.

It sounds reasonable, but what would be the criterion?

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2015-10-21 05:51:08 ATT_FOREIGN_TABLE and ATWrongRelkindError()
Previous Message Kouhei Kaigai 2015-10-21 04:34:34 Re: Foreign join pushdown vs EvalPlanQual