Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-06-20 06:57:57
Message-ID: alpine.DEB.2.10.1506200817400.31742@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello Andres,

>>> - Move fsync as early as possible, suggested by Andres Freund?
>>> My opinion is that this should be left out for the nonce.
> "for the nonce" - what does that mean?

Nonce \Nonce\ (n[o^]ns), n. [For the nonce, OE. for the nones, ...
{for the nonce}, i. e. for the present time.

> I'm doubtful that it's a good idea to separate this out, if you did.

Actually I did, because as explained in another mail the fsync time when
the other options are activated as reported in the logs is essentially
null, so it would not bring significant improvements on these runs,
and also the patch changes enough things as it is.

So this is an evidence-based decision.

I also agree that it seems interesting on principle and should be
beneficial in some case, but I would rather keep that on a TODO list
together with trying to do better things in the bgwriter and try to focus
on the current proposal which already changes significantly the
checkpointer throttling logic.

>> - as version 2: checkpoint buffer sorting based on a 2007 patch by
>> Takahiro Itagaki but with a smaller and static buffer allocated once.
>> Also, sorting is done by chunks of 131072 pages in the current version,
>> with a guc to change this value.
> I think it's a really bad idea to do this in chunks.

The small problem I see is that for a very large setting there could be
several seconds or even minutes of sorting, which may or may not be
desirable, so having some control on that seems a good idea.

Another argument is that Tom said he wanted that:-)

In practice the value can be set at a high value so that it is nearly
always sorted in one go. Maybe value "0" could be made special and used to
trigger this behavior systematically, and be the default.

> That'll mean we'll frequently uselessly cause repetitive random IO,

This is not an issue if the chunks are large enough, and anyway the guc
allows to change the behavior as desired. As I said, keeping some control
seems a good idea, and the "full sorting" can be made the default

> often interleaved. That pattern is horrible for SSDs too. We should
> always try to do this at once, and only fail back to using less memory
> if we couldn't allocate everything.

The memory is needed anyway in order to avoid a double or significantly
more heavy implementation for the throttling loop. It is allocated once on
the first checkpoint. The allocation could be moved to the checkpointer
initialization if this is a concern. The memory needed is one int per
buffer, which is smaller than the 2007 patch.

>> . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
> It'd be interesting to see numbers for tiny, without the overly small
> checkpoint timeout value. 30s is below the OS's writeback time.

The point of tiny was to trigger a lot of checkpoints. The size is pretty
ridiculous anyway, as "tiny" implies. I think I did some tests on other
versions of the patch and longer checkpoint_timeout on pretty small
database that showed smaller benefit from the options, as one would
expect. I'll try to re-run some.

> So you've not run things at more serious concurrency, that'd be
> interesting to see.

I do not have a box available for "serious concurrency".

> I'd also like to see concurrent workloads with synchronous_commit=off -
> I've seen absolutely horrible latency behaviour for that, and I'm hoping
> this will help. It's also a good way to simulate faster hardware than
> you have.

> It's also curious that sorting is detrimental for full speed 'tiny'.


>> With SSD probably both options would probably have limited benefit.
> I doubt that. Small random writes have bad consequences for wear
> leveling. You might not notice that with a short tests - again, I doubt
> it - but it'll definitely become visible over time.

Possibly. Testing such effects does not seem easy, though. At least I have
not seen "write stalls" on SSD, which is my primary concern.


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2015-06-20 07:22:53 Re: pgbench - allow backslash-continuations in custom scripts
Previous Message Feng Tian 2015-06-20 06:54:46 Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H