Re: checkpointer continuous flushing - V18

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: checkpointer continuous flushing - V18
Date: 2016-02-21 07:26:28
Message-ID: alpine.DEB.2.10.1602210746250.3927@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hallo Andres,

>> In some previous version I think a warning was shown if the feature was
>> requested but not available.
>
> I think we should either silently ignore it, or error out. Warnings
> somewhere in the background aren't particularly meaningful.

I like "ignoring with a warning" in the log file, because when things do
not behave as expected that is where I'll be looking. I do not think that
it should error out.

>> The sgml documentation about "*_flush_after" configuration parameter
>> talks about bytes, but the actual unit should be buffers.
>
> The unit actually is buffers, but you can configure it using
> bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
> ...). Refering to bytes is easier because you don't have to explain that
> it depends on compilation settings how many data it actually is and
> such.

So I understand that it works with kb as well. Now I do not think that it
would need a lot if explanations if you say that it is a number of pages,
and I think that a number of pages is significant because it is a number
of IO requests to be coalesced, eventually.

>> In the discussion in the wal section, I'm not sure about the effect of
>> setting writebacks on SSD, [...]
>
> Yea, that paragraph needs some editing. I think we should basically
> remove that last sentence.

Ok, fine with me. Does that mean that flushing as a significant positive
impact on SSD in your tests?

>> However it does not address the point that bgwriter and backends
>> basically issue random writes, [...]
>
> The benefit is primarily that you don't collect large amounts of dirty
> buffers in the kernel page cache. In most cases the kernel will not be
> able to coalesce these writes either... I've measured *massive*
> performance latency differences for workloads that are bigger than
> shared buffers - because suddenly bgwriter / backends do the majority of
> the writes. Flushing in the checkpoint quite possibly makes nearly no
> difference in such cases.

So I understand that there is a positive impact under some load. Good!

>> Maybe the merging strategy could be more aggressive than just strict
>> neighbors?
>
> I don't think so. If you flush more than neighbouring writes you'll
> often end up flushing buffers dirtied by another backend, causing
> additional stalls.

Ok. Maybe the neightbor definition could be relaxed just a little bit so
that small holes are overtake, but not large holes? If there is only a few
pages in between, even if written by another process, then writing them
together should be better? Well, this can wait for a clear case, because
hopefully the OS will recoalesce them behind anyway.

>> struct WritebackContext: keeping a pointer to guc variables is a kind of
>> trick, I think it deserves a comment.
>
> It has, it's just in WritebackContextInit(). Can duplicateit.

I missed it, I expected something in the struct definition. Do not
duplicate, but cross reference it?

>> IssuePendingWritebacks: I understand that qsort is needed "again"
>> because when balancing writes over tablespaces they may be intermixed.
>
> Also because the infrastructure is used for more than checkpoint
> writes. There's absolutely no ordering guarantees there.

Yep, but not much benefit to expect from a few dozens random pages either.

>> [...] I do think that this whole writeback logic really does make sense
>> *per table space*,
>
> Leads to less regular IO, because if your tablespaces are evenly sized
> (somewhat common) you'll sometimes end up issuing sync_file_range's
> shortly after each other. For latency outside checkpoints it's
> important to control the total amount of dirty buffers, and that's
> obviously independent of tablespaces.

I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per
device as well (otherwise what the point?), so you should want to coalesce
and "writeback" pages per device as wel. Calling sync_file_range on
distinct devices should probably be issued more or less randomly, and
should not interfere one with the other.

If you use just one context, the more table spaces the less performance
gains, because there is less and less aggregation thus sequential writes
per device.

So for me there should really be one context per tablespace. That would
suggest a hashtable or some other structure to keep and retrieve them,
which would not be that bad, and I think that it is what is needed.

>> For the checkpointer, a key aspect is that the scheduling process goes
>> to sleep from time to time, and this sleep time looked like a great
>> opportunity to do this kind of flushing. You choose not to take advantage
>> of the behavior, why?
>
> Several reasons: Most importantly there's absolutely no guarantee that
> you'll ever end up sleeping, it's quite common to happen only seldomly.

Well, that would be under a situation when pg is completely unresponsive.
More so, this behavior *makes* pg unresponsive.

> If you're bottlenecked on IO, you can end up being behind all the time.

Hopefully sorting & flushing should improve this situation a lot.

> But even then you don't want to cause massive latency spikes
> due to gigabytes of dirty data - a slower checkpoint is a much better
> choice. It'd make the writeback infrastructure less generic.

Sure. It would be sufficient to have a call to ask for writebacks
independently of the number of writebacks accumulated in the queue, it
does not need to change the infrastructure.

Also, I think that such a call would make sense at the end of the
checkpoint.

> I also don't really believe it helps that much, although that's a
> complex argument to make.

Yep. My thinking is that doing things in the sleeping interval does not
interfere with the checkpointer scheduling, so it is less likely to go
wrong and falling behind.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Kellerer 2016-02-21 07:39:48 Re: JDBC behaviour
Previous Message Amit Kapila 2016-02-21 06:54:18 Re: Speed up Clog Access by increasing CLOG buffers