Re: checkpointer continuous flushing

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing
Date: 2015-06-22 05:51:39
Message-ID: alpine.DEB.2.10.1506220713150.16123@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Andres,

>> So this is an evidence-based decision.
>
> Meh. You're testing on low concurrency.

Well, I'm just testing on the available box.

I do not see the link between high concurrency and whether moving fsync as
early as possible would have a large performance impact. I think it might
be interesting if bgwriter is doing a lot of writes, but I'm not sure
under which configuration & load that would be.

>>> I think it's a really bad idea to do this in chunks.
>>
>> The small problem I see is that for a very large setting there could be
>> several seconds or even minutes of sorting, which may or may not be
>> desirable, so having some control on that seems a good idea.
>
> If the sorting of the dirty blocks alone takes minutes, it'll never
> finish writing that many buffers out. That's a utterly bogus argument.

Well, if in the future you have 8 TB of memory (I've seen a 512GB memory
server a few weeks ago), set shared_buffers=2TB, then if I'm not mistaken
in the worst case you may have 256 millions 8k-buffers to checkpoint. Then
it really depends on the I/O backend stuff used by the box, but if you
bought 8 TB of RAM probably you would have a nice I/O stuff attached.

>> Another argument is that Tom said he wanted that:-)
>
> I don't think he said that when we discussed this last.

That is what I was recalling when I wrote this sentence:

http://www.postgresql.org/message-id/6599.1409421040@sss.pgh.pa.us

But it had more to do with memory-allocation management.

>> In practice the value can be set at a high value so that it is nearly always
>> sorted in one go. Maybe value "0" could be made special and used to trigger
>> this behavior systematically, and be the default.
>
> You're just making things too complicated.

ISTM that it is not really complicated, but anyway it is easy to change
the checkpoint_sort stuff to a boolean.

In the reported performance tests, the is usually just one chunk anyway,
sometimes two, so this gives an idea of the overall performance effect.

>> This is not an issue if the chunks are large enough, and anyway the guc
>> allows to change the behavior as desired.
>
> I don't think this is true. If two consecutive blocks are dirty, but you
> sync them in two different chunks, you *always* will cause additional
> random IO.

I think that it could be a small number if the chunks are large, i.e. the
performance benefit of sorting larger and larger chunks is decreasing.

> Either the drive will have to skip the write for that block,
> or the os will prefetch the data. More importantly with SSDs it voids
> the wear leveling advantages.

Possibly. I do not understand wear leveling done by SSD firmware.

>>> often interleaved. That pattern is horrible for SSDs too. We should always
>>> try to do this at once, and only fail back to using less memory if we
>>> couldn't allocate everything.
>>
>> The memory is needed anyway in order to avoid a double or significantly more
>> heavy implementation for the throttling loop. It is allocated once on the
>> first checkpoint. The allocation could be moved to the checkpointer
>> initialization if this is a concern. The memory needed is one int per
>> buffer, which is smaller than the 2007 patch.
>
> There's a reason the 2007 patch (and my revision of it last year) did
> what it did. You can't just access buffer descriptors without
> locking.

I really think that you can because the sorting is really "advisory", i.e.
the checkpointer will work fine if the sorting is wrong or not done at
all, as it is now, when the checkpointer writes buffers. The only
condition is that the buffers must not be moved with their "to write in
this checkpoint" flag, but this is also necessary for the current
checkpointer stuff to work.

Moreover, this trick is alreay pre-existing from the patch I submitted:
some tests are done without locking, but the actual "buffer write" does
the locking and would skip it if the previous test was wrong, as described
in comments in the code.

> Besides, causing additional cacheline bouncing during the
> sorting process is a bad idea.

Hmmm. The impact would be to multiply the memory required by 3 or 4
(buf_id, relation, forknum, offset), instead of just buf_id, and I
understood that memory was a concern.

Moreover, once the sort process get the lines which contain the sorting
data from the buffer descriptor in its cache, I think that it should be
pretty much okay. Incidentally, they would probably have been brought to
cache by the scan to collect them. Also, I do not think that the sorting
time for 128000 buffers, and possible cache misses, was a big issue, but I
do not have a measure to defend that. I could try to collect some data
about that.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-06-22 06:51:34 user space function "is_power_user"
Previous Message Jeff Janes 2015-06-22 05:47:31 Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H