Re: Checkpoint sync pause

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checkpoint sync pause
Date: 2012-02-07 21:22:11
Message-ID: 4F319603.6090501@gregsmith.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/03/2012 11:41 PM, Jeff Janes wrote:
>> -The steady stream of backend writes that happen between checkpoints have
>> filled up most of the OS write cache. A look at /proc/meminfo shows around
>> 2.5GB "Dirty:"
> "backend writes" includes bgwriter writes, right?

Right.

> Has using a newer kernal with dirty_background_bytes been tried, so it
> could be set to a lower level? If so, how did it do? Or does it just
> refuse to obey below the 5% level, as well?

Trying to dip below 5% using dirty_background_bytes slows VACUUM down
faster than it improves checkpoint latency. Since the sort of servers
that have checkpoint issues are quite often ones that have VACUUM ones,
too, that whole path doesn't seem very productive. The one test I
haven't tried yet is whether increasing the size of the VACUUM ring
buffer might improve how well the server responds to a lower write cache.

> If there is 3GB of dirty data spread over>300 segments each segment
> is about full-sized (1GB) then on average<1% of each segment is
> dirty?
>
> If that analysis holds, then it seem like there is simply an awful lot
> of data has to be written randomly, no matter how clever the
> re-ordering is. In other words, it is not that a harried or panicked
> kernel or RAID control is failing to do good re-ordering when it has
> opportunities to, it is just that you dirty your data too randomly for
> substantial reordering to be possible even under ideal conditions.

Averages are deceptive here. This data follows the usual distribution
for real-world data, which is that there is a hot chunk of data that
receives far more writes than average (particularly index blocks), along
with a long tail of segments that are only seeing one or two 8K blocks
modified (catalog data, stats, application metadata).

Plenty of useful reordering happens here. It's happening in Linux's
cache and in the controller's cache. The constant of stream of
checkpoint syncs doesn't stop that. It does seems to do two bad things
though: a) makes some of these bad cache filled situations more likely,
and b) doesn't leave any I/O capacity unused for clients to get some
work done. One of the real possibilities I've been considering more
lately is that the value we've seen of the pauses during sync aren't so
much about optimizing I/O, that instead it's from allowing a brief
window of client backend I/O to slip in there between the cache filling
checkpoint sync.

> Does the BBWC, once given an fsync command and reporting success,
> write out those block forthwith, or does it lolly-gag around like the
> kernel (under non-fsync) does? If it is waiting around for
> write-combing opportunities that will never actually materialize in
> sufficient quantities to make up for the wait, how to get it to stop?
>
> Was the sorted checkpoint with an fsync after every file (real file,
> not VFD) one of the changes you tried?

As far as I know the typical BBWC is always working. When a read or a
write comes in, it starts moving immediately. When it gets behind, it
starts making seek decisions more intelligently based on visibility of
the whole queue. But they don't delay doing any work at all the way
Linux does.

I haven't had very good luck with sorting checkpoints at the PostgreSQL
relation level on server-size systems. There is a lot of sorting
already happening at both the OS (~3GB) and BBWC (>=512MB) levels on
this server. My own tests on my smaller test server--with a scaled down
OS (~750MB) and BBWC (256MB) cache--haven't ever validated sorting as a
useful technique on top of that. It's never bubbled up to being
considered a likely win on the production one as a result.

>> DEBUG: Sync #1 time=21.969000 gap=0.000000 msec
>> DEBUG: Sync #2 time=40.378000 gap=0.000000 msec
>> DEBUG: Sync #3 time=12574.224000 gap=3007.614000 msec
>> DEBUG: Sync #4 time=91.385000 gap=2433.719000 msec
>> DEBUG: Sync #5 time=2119.122000 gap=2836.741000 msec
>> DEBUG: Sync #6 time=67.134000 gap=2840.791000 msec
>> DEBUG: Sync #7 time=62.005000 gap=3004.823000 msec
>> DEBUG: Sync #8 time=0.004000 gap=2818.031000 msec
>> DEBUG: Sync #9 time=0.006000 gap=3012.026000 msec
>> DEBUG: Sync #10 time=302.750000 gap=3003.958000 msec
> Syncs 3 and 5 kind of surprise me. It seems like the times should be
> more bimodal. If the cache is already full, why doesn't the system
> promptly collapse, like it does later? And if it is not, why would it
> take 12 seconds, or even 2 seconds? Or is this just evidence that the
> gaps you are inserting are partially, but not completely, effective?

Given a mix of completely random I/O, a 24 disk array like this system
has is lucky to hit 20MB/s clearing it out. It doesn't take too much of
that before even 12 seconds makes sense. The 45 second pauses are the
ones demonstrating the controller's cached is completely overwhelmed.
It's rare to see caching turn truly bimodal, because the model for it
has both a variable ingress and egress rate involved. Even as the
checkpoint sync is pushing stuff in, at the same time writes are being
evacuated at some speed out the other end.

> What I/O are they trying to do? It seems like all your data is in RAM
> (if not, I'm surprised you can get queries to ran fast enough to
> create this much dirty data). So they probably aren't blocking on
> reads which are being interfered with by all the attempted writes.

Reads on infrequently read data. Long tail again; even though caching
is close to 100%, the occasional outlier client who wants some rarely
accessed page with their personal data on it shows up. Pollute the
write caches badly enough, and what happens to reads mixed into there
gets very fuzzy. Depends on the exact mechanics of the I/O scheduler
used in the kernel version deployed.

> The current shared_buffer allocation method (or my misunderstanding of
> it) reminds me of the joke about the guy who walks into his kitchen
> with a cow-pie in his hand and tells his wife "Look what I almost
> stepped in". If you find a buffer that is usagecount=0 and unpinned,
> but dirty, then why is it dirty? It is likely to be dirty because the
> background writer can't keep up. And if the background writer can't
> keep up, it is probably having trouble with writes blocking. So, for
> Pete's sake, don't try to write it out yourself! If you can't find a
> clean, reusable buffer in a reasonable number of attempts, I guess at
> some point you need to punt and write one out. But currently it grabs
> the first unpinned usagecount=0 buffer it sees and writes it out if
> dirty, without even checking if the next one might be clean.

Don't forget that in the version deployed here, the background writer
isn't running during the sync phase. I think the direction you're
talking about here circles back to "why doesn't the BGW just put things
it finds clean onto the free list?", a direction which would make
"nothing on the free list" a noteworthy event suggesting the BGW needs
to run more often.

> One option for pgbench I've contemplated was better latency reporting.
> I don't really want to have mine very large log files (and just
> writing them out can produce IO that competes with the IO you actually
> care about, if you don't have a lot of controllers around to isolate
> everything.).

Every time I've measured this, I've found it to be <1% of the total
I/O. The single line of data with latency counts, written buffered, is
pretty slim compared with the >=8K any write transaction is likely to
have touched. The only time I've found the disk writing overhead
becoming serious on an absolute scale is when I'm running read-only
in-memory benchmarks, where the rate might hit >100K TPS. But by
definition, that sort of test has I/O bandwidth to spare, so there it
doesn't actually impact results much. Just a fraction of a core doing
some sequential writes.

> Also, what limits the amount of work that needs to get done? If you
> make a change that decreases throughput but also decreases latency,
> then something else has got to give.

The thing that is giving way here is total time taken to execute the
checkpoint. There's even a theoretical gain possible form that. It's
possible to prove (using the pg_stat_bgwriter counts) that having
checkpoints less frequently decreases total I/O, because there are less
writes of the most popular blocks happening. Right now, when I tune
that to decrease total I/O the upper limit is when it starts spiking up
latency. This new GUC is trying to allow a different way to increase
checkpoint time that seems to do less of that.

> What problems do you see with pgbench? Can you not reproduce
> something similar to the production latency problems, or can you
> reproduce them, but things that fix the problem in pgbench don't
> translate to production? Or the other way around, things that work in
> production didn't work in pgbench?

I can't simulate something similar enough to the production latency
problem. Your comments about doing something like specifying 50 "-f"
files or a weighting are in the right area; it might be possible to hack
a better simulation with an approach like that. The thing that makes
wandering that way even harder than it seems at first is how we split
the pgbench work among multiple worker threads.

I'm not using connection pooling for the pgbench simulations I'm doing.
There's some of that happening in the production application server.with it.

> But I would think that pgbench can be configured to do that as well,
> and would probably offer a wider array of other testers. Of course,if
> they have to copy and specify 30 different -f files, maybe getting
> dbt-2 to install and run would be easier than that. My attempts at
> getting dbt-5 to work for me do not make me eager jump from pgbench to
> try more other things.

dbt-5 is a work in progress, known to be tricky to get going. dbt-2 is
mature enough that it was used for this sort of role in 8.3
development. And it's even used by other database systems for similar
testing. It's the closest thing to an open-source standard for
write-heavy workloads as we'll find here.

What I'm doing right now is recording a large amount of pgbench data for
my test system here, to validate it has the typical problems pgbench
runs into. Once that's done I expect to switch to dbt-2 and see whether
it's a more useful latency test environment. That plan is working out
fine so far, it just hit a couple of weeks of unanticipated delay.

> Do we have a theoretical guess on about how fast you should be able to
> go, based on the RAID capacity and the speed and density at which you
> dirty data?

This is a hard question to answer; it's something I've been thinking
about and modeling a lot lately. The problem is that the speed an array
writes at depends on how many reads or writes it does during each seek
and/or rotation. The array here can do 1GB/s of all sequential I/O, and
15 - 20MB/s on all random I/O. The more efficiently writes are
scheduled, the more like sequential I/O the workload becomes. Any
attempt to even try to estimate real-world throughput needs the number
of concurrent processes as another input, and the complexity of the
resulting model is high.

--
Greg Smith 2ndQuadrant USgreg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Supportwww.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2012-02-07 21:25:22 Re: patch for implementing SPI_gettypemod()
Previous Message Andrew Dunstan 2012-02-07 21:01:21 Re: Text-any concatenation volatility acting as optimization barrier