Re: Checkpoint sync pause

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checkpoint sync pause
Date: 2012-02-12 18:43:43
Message-ID: CAMkU=1xvJjsRusYu8WgfLRRCbocDCraV1A1PqawkJe-WZWNfEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 7, 2012 at 1:22 PM, Greg Smith <gsmith(at)gregsmith(dot)com> wrote:
> On 02/03/2012 11:41 PM, Jeff Janes wrote:
>>>
>>> -The steady stream of backend writes that happen between checkpoints have
>>> filled up most of the OS write cache.  A look at /proc/meminfo shows
>>> around
>>> 2.5GB "Dirty:"
>>
>> "backend writes" includes bgwriter writes, right?
>
>
> Right.
>
>
>> Has using a newer kernal with dirty_background_bytes been tried, so it
>> could be set to a lower level?  If so, how did it do?  Or does it just
>> refuse to obey below the 5% level, as well?
>
>
> Trying to dip below 5% using dirty_background_bytes slows VACUUM down faster
> than it improves checkpoint latency.

Does it cause VACUUM to create latency for other processes (like the
checkpoint syncs do, by gumming up the IO for everyone) or does VACUUM
just slow down without effecting other tasks?

It seems to me that just lowering dirty_background_bytes (while not
also lowering dirty_bytes) should not cause the latter to happen, but
it seems like these kernel tunables never do exactly what they
advertise.

This may not be relevant to the current situation, but I wonder if we
don't need a "vacuum_cost_page_dirty_seq" so that if the pages we are
dirtying are consecutive (or at least closely spaced) they cost less,
in anticipation that the eventual writes will be combined and thus
consume less IO resources. I would think it would be common for some
regions of table to be intensely dirtied, and some to be lightly
dirtied (but still aggregating up to a considerable amount of random
IO). But the vacuum process might also need to be made more
"bursty", as even if it generates sequential dirty pages the IO system
might write them randomly anyway if there are too many delays
interspersed

> Since the sort of servers that have
> checkpoint issues are quite often ones that have VACUUM ones, too, that
> whole path doesn't seem very productive.  The one test I haven't tried yet
> is whether increasing the size of the VACUUM ring buffer might improve how
> well the server responds to a lower write cache.

I wouldn't expect this to help. It seems like it would hurt, as it
just leaves the data for even longer (however long it takes to
circumnavigate the ring buffer) before there is any possibility of it
getting written. I guess it does increase the chances that the dirty
pages will "accidentally" get written by the bgwriter rather than the
vacuum itself, but I doubt that that would be significant.

...
>>
>> Was the sorted checkpoint with an fsync after every file (real file,
>> not VFD) one of the changes you tried?
>
>
...
>
> I haven't had very good luck with sorting checkpoints at the PostgreSQL
> relation level on server-size systems.  There is a lot of sorting already
> happening at both the OS (~3GB) and BBWC (>=512MB) levels on this server.
>  My own tests on my smaller test server--with a scaled down OS (~750MB) and
> BBWC (256MB) cache--haven't ever validated sorting as a useful technique on
> top of that.  It's never bubbled up to being considered a likely win on the
> production one as a result.

Without sorted checkpoints (or some other fancier method) you have to
write out the entire pool before you can do any fsyncs. Or you have
to do multiple fsyncs of the same file, with at least one occurring
after the entire pool was written. With a sorted checkpoint, you can
start issuing once-only fsyncs very early in the checkpoint process.
I think that on large servers, that would be the main benefit, not the
actually more efficient IO. (On small servers I've seen sorted
checkpoints be much faster on shutdown checkpoints, but not on natural
checkpoints, and presumably this improvement *is* due to better
ordering).

On your servers, you need big delays between fsyncs and not between
writes (as they are buffered until the fsync). But in other
situations, people need the delays between the writes. By using
sorted checkpoints with fsyncs between each file, the delays between
writes are naturally delays between fsyncs as well. So I think the
benefit of using sorted checkpoints is that code to improve your
situations is less likely to degrade someone else's situation, without
having to introduce an extra layer of tunables.

>
>> What I/O are they trying to do?  It seems like all your data is in RAM
>> (if not, I'm surprised you can get queries to ran fast enough to
>> create this much dirty data).  So they probably aren't blocking on
>> reads which are being interfered with by all the attempted writes.
>
>
> Reads on infrequently read data.  Long tail again; even though caching is
> close to 100%, the occasional outlier client who wants some rarely accessed
> page with their personal data on it shows up.  Pollute the write caches
> badly enough, and what happens to reads mixed into there gets very fuzzy.
>  Depends on the exact mechanics of the I/O scheduler used in the kernel
> version deployed.

OK, but I would still think it is a minority of transactions which
need at least one of those infrequently read data and most do not. So
a few clients would freeze, but the rest should keep going until they
either try to execute a read themselves, or they run into a
heavyweight lock held by someone else who is read-blocking. So if
1/1000 of all transactions need to make a disk read, but clients are
running at 100s of TPS, then I guess after a few tens of seconds all
clients will be blocked on reads and you will see total freeze up.
But it seems more likely to me that they are in fact freezing on
writes. Is there a way to directly observe what they are blocking on?
I wish "top" would separate %wait into read and write.

>
>
>> The current shared_buffer allocation method (or my misunderstanding of
>> it) reminds me of the joke about the guy who walks into his kitchen
>> with a cow-pie in his hand and tells his wife "Look what I almost
>> stepped in".  If you find a buffer that is usagecount=0 and unpinned,
>> but dirty, then why is it dirty?  It is likely to be dirty because the
>> background writer can't keep up.  And if the background writer can't
>> keep up, it is probably having trouble with writes blocking.  So, for
>> Pete's sake, don't try to write it out yourself!  If you can't find a
>> clean, reusable buffer in a reasonable number of attempts, I guess at
>> some point you need to punt and write one out.  But currently it grabs
>> the first unpinned usagecount=0 buffer it sees and writes it out if
>> dirty, without even checking if the next one might be clean.
>
>
> Don't forget that in the version deployed here, the background writer isn't
> running during the sync phase.

Oh, I had thought you had compiled your own custom work around to
that. So much of the problem might go away upon a new release and an
upgrade, as far as we know?

>  I think the direction you're talking about
> here circles back to "why doesn't the BGW just put things it finds clean
> onto the free list?",

I wouldn't put it that way, because to me the freelist is the code
located in freelist.c. The linked list is a freelist. But the clock
sweep is also a freelist, just implemented in a different way.

If the hypothetical BGW doesn't remove the entry from the buffer
mapping table and invalidate it when it adds to the linked list, then
we might pull a "free" buffer from the linked list and discover it is
not actually free. If we want to make it so that it does remove the
entry from the buffer mapping table (which doesn't seem like a good
idea to me) we could implement that just as well with the clock-sweep
as we could with the linked list.

I think the linked list is a bit of a red herring. Many of the
concepts people discuss implementing on the linked list could just as
easily be implemented with the clock sweep. And I've seen no evidence
at all that the clock sweep is the problem. The LWLock that protects
can obviously be a problem, but that seems to be due to the overhead
of acquiring a contended lock, not the work done under the lock.
Reducing the lock-strength around this might be a good idea, but that
reduction could be done just as easily (and as far as I can tell, more
easily) with the clock sweep than the linked list.

> a direction which would make "nothing on the free
> list" a noteworthy event suggesting the BGW needs to run more often.

Isn't seeing a dirty unpinned usage_count==0 buffer in the clocksweep
just as noteworthy as seeing an empty linked list? From what I can
tell, you can't dirty a buffer without pinning it, you can't pin a
buffer without making usage_count>0, and we never decrement
usage_count on a pinned buffer. So, the only way to see a dirty
buffer that is unpinned and has zero usage_count is if another normal
backend saw it unpinned and decremented the count, which would have to
be a full clock sweep ago, and the bgwriter hasn't visited it since
then.

If our goal is to autotune the bgwriter_* parameters, then detecting
either an empty linked list or dirty but usable buffer in the clock
sweep would be a good way to do that. But, I think the bigger issue
is to assume that the bgwriter is already tuned as well as it can be,
and that beating on it further will not improve its moral. If the IO
write caches are all full, there is nothing bgwriter can do about it
by running more often. In that case, we can't really do anything
about the dirty pages it is leaving around our yard. But what we can
do is not pick up those little piles of toxic waste and bring them
into our living rooms. That is, don't try to write out the dirty page
in the foreground, instead go looking for a clean one. We can evict
it without doing a write, and hopefully we can read in the replacement
either from OS cache, or from disk if reads are not as gummed up as
writes are.

>> But I would think that pgbench can be configured to do that as well,
>> and would probably offer a wider array of other testers.  Of course,if
>> they have to copy and specify 30 different -f files, maybe getting
>> dbt-2 to install and run would be easier than that.  My attempts at
>> getting dbt-5 to work for me do not make me eager jump from pgbench to
>> try more other things.
>
>
> dbt-5 is a work in progress, known to be tricky to get going.  dbt-2 is
> mature enough that it was used for this sort of role in 8.3 development.
>  And it's even used by other database systems for similar testing.  It's the
> closest thing to an open-source standard for write-heavy workloads as we'll
> find here.

OK, thanks for the reassurance. I'll no longer be afraid to give it a
try if I get an opportunity.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2012-02-12 19:30:40 Re: Should I implement DROP INDEX CONCURRENTLY?
Previous Message Gaetano Mendola 2012-02-12 18:31:16 Re: CUDA Sorting