Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-08-17 19:32:24
Message-ID: alpine.DEB.2.10.1508171911360.5011@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Andres,

>>> [...] posix_fadvise().
>>
>> My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
>> implementation of posix_fadvise, so it may differ between OS.
>
> As long as fadvise has no 'undirty' option, I don't see how that
> problem goes away. You're telling the OS to throw the buffer away, so
> unless it ignores it that'll have consequences when you read the page
> back in.

Yep, probably.

Note that we are talking about checkpoints, which "write" buffers out
*but* keep them nevertheless. As the buffer is kept, the OS page is a
duplicate, and freeing it should not harm, at least immediatly.

The situation is different if the memory is reused in between, which is
the work of the bgwriter I think, based on LRU/LFU heuristics, but such
writes are not flushed by the current patch.

Now, if a buffer was recently updated it should not be selected by the
bgwriter, if the LRU/LFU heuristics works as expected, which mitigate the
issue somehow...

To sum up, I agree that it is indeed possible that flushing with
posix_fadvise could reduce read OS-memory hits on some systems for some
workloads, although not on Linux, see below.

So the option is best kept as "off" for now, without further data, I'm
fine with that.

> [...] I'd say it should then be an os-specific default. No point in
> making people work for it needlessly on linux and/or elsewhere.

Ok. Version 9 attached does that, "on" for Linux, "off" for others because
of the potential issues you mentioned.

>> (Another reason to keep it "off" is that I'm not sure about what
>> happens with such HD flushing features on virtual servers).
>
> I don't see how that matters? Either the host will entirely ignore
> flushing, and thus the sync_file_range and the fsync won't cost much, or
> fsync will be honored, in which case the pre-flushing is helpful.

Possibly. I know that I do not know:-) The distance between the database
and real hardware is so great in VM, that I think that it may have any
effect, including good, bad or none:-)

>> Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
>> and it was as bad as Linux (namely the database and even the box was offline
>> for long minutes...), and if you can avoid that having to read back some
>> data may be not that bad a down payment.
>
> I don't see how that'd alleviate my fear.

I'm trying to mitigate your fears, not to alleviate them:-)

> Sure, the latency for many workloads will be better, but I don't how
> that argument says anything about the reads?

It just says that there may be a compromise, better in some case, possibly
not so in others, because posix_fadvise does not really say what the
database would like to say to the OS, this is why I wrote such a large
comment about it in the source file in the first place.

> And we'll not just use this in cases it'd be beneficial...

I'm fine if it is off by default for some systems. If people want to avoid
write stalls they can use the option, but it may have adverse effect on
the tps in some cases, that's life? Not using the option also has adverse
effects in some cases, because you have write stalls... and currently you
do not have the choice, so it would be a progress.

>> The issue is largely mitigated if the data is not removed from
>> shared_buffers, because the OS buffer is just a copy of already hold data.
>> What I would do on such systems is to increase shared_buffers and keep
>> flushing on, that is to count less on the system cache and more on postgres
>> own cache.
>
> That doesn't work that well for a bunch of reasons. For one it's
> completely non-adaptive. With the OS's page cache you can rely on free
> memory being used for caching *and* it be available should a query or
> another program need lots of memory.

Yep. I was thinking about a dedicated database server, not a shared one.

>> Overall, I'm not convince that the practice of relying on the OS cache is a
>> good one, given what it does with it, at least on Linux.
>
> The alternatives aren't super realistic near-term though. Using direct
> IO efficiently on the set of operating systems we support is
> *hard*. [...]

Sure. This is not necessarily what I had in mind.

Currently pg "write"s stuff to the OS, and then suddenly calls "fsync" out
of the blue, hoping that in between the OS will actually have done a good
job with the underlying hardware. This is pretty naive, the fsync
generates write storms, and the database is offline: trying to improve
these things is the motivation for this patch.

Now if you think of the bgwriter, it does pretty much the same, and
probably may generate plenty of random I/Os, because the underlying
LRU/LFU heuristics used to select buffers does not care about the file
structures.

So I think that to get good performance the database must take some
control over the OS. That does not mean that direct I/O needs to be
involved, although maybe it could, but this patch shows that it is not
needed to improve things.

>> Now, if someone could provide a dedicated box with posix_fadvise (say
>> FreeBSD, maybe others...) for testing that would allow to provide data
>> instead of speculating... and then maybe to decide to change its default
>> value.
>
> Testing, as an approximation, how it turns out to work on linux would be
> a good step.

Do you mean testing with posix_fadvise on Linux?

I did think about it, but the documented behavior of this call on Linux is
disappointing: if the buffer has been written to disk, it is freed by the
OS. If not, nothing is done. Given that the flush is called pretty close
after writes, mostly the buffer will not have been written to disk yet,
and the call would just be a no-op... So I concluded that there is no
point in trying that on Linux because it will have no effect other than
loosing some time, IMO.

Really, a useful test would be FreeBSD, when posix_fadvise does move
things to disk, although the actual offsets & length are ignored, but I do
not think that it would be a problem. I do not know about other systems
and what they do with posix_fadvise.

--
Fabien.

Attachment Content-Type Size
checkpoint-continuous-flush-9-a.patch text/x-diff 20.6 KB
checkpoint-continuous-flush-9-b.patch text/x-diff 28.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2015-08-17 19:41:46 Re: jsonb array-style subscripting
Previous Message Merlin Moncure 2015-08-17 19:26:21 Re: jsonb array-style subscripting