Re: Checkpoint sync pause

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checkpoint sync pause
Date: 2012-02-04 04:41:54
Message-ID: CAMkU=1yFju3A56ad6E-mh7-D0d7nhOn3psDasKeb=jjZ-pGqXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 16, 2012 at 5:59 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 01/16/2012 11:00 AM, Robert Haas wrote:
>>
>> Also, I am still struggling with what the right benchmarking methodology
>> even is to judge whether
>> any patch in this area "works".  Can you provide more details about
>> your test setup?
>
>
> The "test" setup is a production server with a few hundred users at peak
> workload, reading and writing to the database.  Each RAID controller (couple
> of them with their own tablespaces) has either 512MG or 1GB of
> battery-backed write cache.  The setup that leads to the bad situation
> happens like this:
>
> -The steady stream of backend writes that happen between checkpoints have
> filled up most of the OS write cache.  A look at /proc/meminfo shows around
> 2.5GB "Dirty:"

"backend writes" includes bgwriter writes, right?

>
> -Since we have shared_buffers set to 512MB to try and keep checkpoint storms
> from being too bad, there might be 300MB of dirty pages involved in the
> checkpoint.  The write phase dumps this all into Linux's cache.  There's now
> closer to 3GB of dirty data there.  @64GB of RAM, this is still only 4.7%
> though--just below the effective lower range for dirty_background_ratio.

Has using a newer kernal with dirty_background_bytes been tried, so it
could be set to a lower level? If so, how did it do? Or does it just
refuse to obey below the 5% level, as well?

>  Linux is perfectly content to let it all sit there.
>
> -Sync phase begins.  Between absorption and the new checkpoint writes, there
> are >300 segments to sync waiting here.

If there is 3GB of dirty data spread over >300 segments each segment
is about full-sized (1GB) then on average <1% of each segment is
dirty?

If that analysis holds, then it seem like there is simply an awful lot
of data has to be written randomly, no matter how clever the
re-ordering is. In other words, it is not that a harried or panicked
kernel or RAID control is failing to do good re-ordering when it has
opportunities to, it is just that you dirty your data too randomly for
substantial reordering to be possible even under ideal conditions.

Does the BBWC, once given an fsync command and reporting success,
write out those block forthwith, or does it lolly-gag around like the
kernel (under non-fsync) does? If it is waiting around for
write-combing opportunities that will never actually materialize in
sufficient quantities to make up for the wait, how to get it to stop?

Was the sorted checkpoint with an fsync after every file (real file,
not VFD) one of the changes you tried?

> -The first few syncs force data out of Linux's cache and into the BBWC.
>  Some of these return almost instantly.  Others block for a moderate number
> of seconds.  That's not necessarily a showstopper, on XFS at least.  So long
> as the checkpointer is not being given all of the I/O in the system, the
> fact that it's stuck waiting for a sync doesn't mean the server is
> unresponsive to the needs of other backends.  Early data might look like
> this:
>
> DEBUG:  Sync #1 time=21.969000 gap=0.000000 msec
> DEBUG:  Sync #2 time=40.378000 gap=0.000000 msec
> DEBUG:  Sync #3 time=12574.224000 gap=3007.614000 msec
> DEBUG:  Sync #4 time=91.385000 gap=2433.719000 msec
> DEBUG:  Sync #5 time=2119.122000 gap=2836.741000 msec
> DEBUG:  Sync #6 time=67.134000 gap=2840.791000 msec
> DEBUG:  Sync #7 time=62.005000 gap=3004.823000 msec
> DEBUG:  Sync #8 time=0.004000 gap=2818.031000 msec
> DEBUG:  Sync #9 time=0.006000 gap=3012.026000 msec
> DEBUG:  Sync #10 time=302.750000 gap=3003.958000 msec

Syncs 3 and 5 kind of surprise me. It seems like the times should be
more bimodal. If the cache is already full, why doesn't the system
promptly collapse, like it does later? And if it is not, why would it
take 12 seconds, or even 2 seconds? Or is this just evidence that the
gaps you are inserting are partially, but not completely, effective?

>
> [Here 'gap' is a precise measurement of how close the sync pause feature is
> working, with it set to 3 seconds.  This is from an earlier version of this
> patch.  All the timing issues I used to measure went away in the current
> implementation because it doesn't have to worry about doing background
> writer LRU work anymore, with the checkpointer split out]
>
> But after a few hundred of these, every downstream cache is filled up.  The
> result is seeing more really ugly sync times, like #164 here:
>
> DEBUG:  Sync #160 time=1147.386000 gap=2801.047000 msec
> DEBUG:  Sync #161 time=0.004000 gap=4075.115000 msec
> DEBUG:  Sync #162 time=0.005000 gap=2943.966000 msec
> DEBUG:  Sync #163 time=962.769000 gap=3003.906000 msec
> DEBUG:  Sync #164 time=45125.991000 gap=3033.228000 msec
> DEBUG:  Sync #165 time=4.031000 gap=2818.013000 msec
> DEBUG:  Sync #166 time=212.537000 gap=3039.979000 msec
> DEBUG:  Sync #167 time=0.005000 gap=2820.023000 msec
> ...
> DEBUG:  Sync #355 time=2.550000 gap=2806.425000 msec
> LOG:  Sync 355 files longest=45125.991000 msec average=1276.177977 msec
>
> At the same time #164 is happening, that 45 second long window, a pile of
> clients will get stuck where they can't do any I/O.

What I/O are they trying to do? It seems like all your data is in RAM
(if not, I'm surprised you can get queries to ran fast enough to
create this much dirty data). So they probably aren't blocking on
reads which are being interfered with by all the attempted writes.
Your WAL is probably on a separate controller which is not impacted
(unless the impaction is at the kernel level). What other writes
would there be? The only major ones I can think of is backend writes
of shared_buffers, or writes of CLOG buffers because you can't read in
a new one until an existing one is cleaned.

The current shared_buffer allocation method (or my misunderstanding of
it) reminds me of the joke about the guy who walks into his kitchen
with a cow-pie in his hand and tells his wife "Look what I almost
stepped in". If you find a buffer that is usagecount=0 and unpinned,
but dirty, then why is it dirty? It is likely to be dirty because the
background writer can't keep up. And if the background writer can't
keep up, it is probably having trouble with writes blocking. So, for
Pete's sake, don't try to write it out yourself! If you can't find a
clean, reusable buffer in a reasonable number of attempts, I guess at
some point you need to punt and write one out. But currently it grabs
the first unpinned usagecount=0 buffer it sees and writes it out if
dirty, without even checking if the next one might be clean.

> The RAID controller
> that used to have a useful mix of data is now completely filled with >=512MB
> of random writes.  It's now failing to write as fast as new data is coming
> in.  Eventually that leads to pressure building up in Linux's cache.  Now
> you're in the bad place:  dirty_background_ratio is crossed, Linux is now
> worried about spooling all cached writes to disk as fast as it can, the
> checkpointer is sync'ing its own important data to disk as fast as it can
> too, and all caches are inefficient because they're full.
>
> To recreate a scenario like this, I've realized the benchmark needs to have
> a couple of characteristics:
>
> -It has to focus on transaction latency instead of throughput.  We know that
> doing syncs more often will lower throughput due to reduced reordering etc.

One option for pgbench I've contemplated was better latency reporting.
I don't really want to have mine very large log files (and just
writing them out can produce IO that competes with the IO you actually
care about, if you don't have a lot of controllers around to isolate
everything.). I'd like to see a report every 5 seconds about what the
longest latency was over that last 5 seconds.

Doing syncs more often would only lower throughput by reduced
reordering if there were substantial opportunities for reordering to
start with. If you are dirtying 300 segments but only 3GB of data,
then unless a lot of those segments are far from full you might not
have that much opportunity for reordering to start with.

Also, what limits the amount of work that needs to get done? If you
make a change that decreases throughput but also decreases latency,
then something else has got to give. If the available throughput is
less than the work that needs to get done, either connections and
latency will go up until the system will completely falls over, or
some attempts to connect will get refused. The only other possibility
I see is that some clients will get sick of the latency and decide
that their particular chunk of work doesn't need to get done at this
particular time after all--a decision often made by humans but rarely
by benchmarking programs.

> -It cannot run at maximum possible speed all the time.  It needs to be the
> case that the system keeps up with the load during the rest of the time, but
> the sync phase of checkpoints causes I/O to queue faster than it's draining,
> thus saturating all caches and then blocking backends.  Ideally, "Dirty:" in
> /proc/meminfo will reach >90% of the dirty_background_ratio trigger line
> around the same time the sync phase starts.
>
> -There should be a lot of clients doing a mix of work.  The way Linux I/O
> works, the scheduling for readers vs. writers is complicated, and this is
> one of the few areas where things like CFQ vs. Deadline matter.

OK, my idea was just to dial back pgbench's -c until if fits the
criteria of your previous paragraph, but I guess dialing it back would
change the kernels perception of scheduling priorities. Do you use
connection poolers?

> I've realized now one reason I never got anywhere with this while running
> pgbench tests is that pgbench always runs at 100% of capacity.  It fills all
> the caches involved completely as fast as it can, and every checkpoint
> starts with them already filled to capacity.  So when latency gets bad at
> checkpoint time, no amount of clever reordering will help keep those writes
> from interfering with other processes.  There just isn't any room to work
> with left.

But it only does that if you tell it to. If you just keep cranking up
-c until TPS flat-lines or decreases, then of course you will be in
that situation. You can insert sleeps, or just run fewer connections,
assuming that that doesn't run afoul of IO scheduler.

The imbalance between selects versus updates/inserts is harder to
correct for. I've tries to model some things just be starting a
pgbench -S and a default pgbench with a strategic ratio of clients for
each, but it would be nice if this were easier/more automatic.
Another nice feature would be if you could define a weight with -f.
If I want one transaction file to execute 50 times more often than
another, it not so nice to have specify 51 -f switches with 50 of them
all being the same file.

What problems do you see with pgbench? Can you not reproduce
something similar to the production latency problems, or can you
reproduce them, but things that fix the problem in pgbench don't
translate to production? Or the other way around, things that work in
production didn't work in pgbench?

> What I think is needed instead is a write-heavy benchmark with a think time
> in it, so that we can dial the workload up to, say, 90% of I/O capacity, but
> that spikes to 100% when checkpoint sync happens.  Then rearrangements in
> syncing that reduces caching pressure should be visible as a latency
> reduction in client response times.  My guess is that dbt-2 can be
> configured to provide such a workload, and I don't see a way forward here
> except for me to fully embrace that and start over with it.

But I would think that pgbench can be configured to do that as well,
and would probably offer a wider array of other testers. Of course,if
they have to copy and specify 30 different -f files, maybe getting
dbt-2 to install and run would be easier than that. My attempts at
getting dbt-5 to work for me do not make me eager jump from pgbench to
try more other things.

...

> [1] Where I think I'm going to end up with this eventually now is that
> setting checkpoint_sync_pause is the important tunable.  The parameter that
> then gets auto-tuned is checkpoint_timeout.  If you have 300 relations to
> sync and you have to wait 10 seconds between syncs to get latency down, the
> server is going to inform you an hour between checkpoints is all you can do
> here.

Do we have a theoretical guess on about how fast you should be able to
go, based on the RAID capacity and the speed and density at which you
dirty data?

If you just keep spreading out the syncs until a certain latency is
reached, then you are continuing to dirty data a prodigious rate
between those sync. I think it is quite likely you will hit a vicious
circle, where one time the server informs you that one hour-checkpoint
are all you can do, then an hour later tells you that 5 hour
checkpoints are all you can do, and then 5 hours later it tells you
that a week is really the best it can do.

At some point you need to buy more RAID controllers. Any way to know
how close to that point you are?

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2012-02-04 05:02:17 Re: LWLockWaitUntilFree (was: Group commit, revised)
Previous Message Jeff Janes 2012-02-04 02:50:22 Re: Review of: explain / allow collecting row counts without timing info