On Wed, Feb 1, 2012 at 9:47 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> I think we should be working to commit XLogInsert and then Group
>>> Commit, then come back to the discussion.
>> I definitely agree that those two have way more promise than anything
>> else on the table. However, now that I understand how badly we're
>> getting screwed by checkpoints, I'm curious to do some more
>> investigation of what's going on there. I can't help thinking that in
>> these particular cases the full page writes may be a bigger issue than
>> the checkpoint itself. If that turns out to be the case it's not
>> likely to be something we're able to address for 9.2, but I'd like to
>> at least characterize it.
> I did another run with full_page_writes=off:
> This is run with the master branch as of commit
> 4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests,
> so that the results are comparable.
> The graph is pretty dramatically different from the earlier one:
> So are the TPS numbers:
> full page writes on: tps = 2200.848350 (including connections establishing)
> full page writes off: tps = 9494.757916 (including connections establishing)
> It seems pretty clear that, even with full_page_writes=off, the
> checkpoint is affecting things negatively: the first 700 seconds or so
> show much better and more consistent performance than the remaining
> portion of the test. I'm not sure why that is, but I'm guessing there
> was a checkpoint around that time.
We really need to nail that down. Could you post the scripts (on the
wiki) you use for running the benchmark and making the graph? I'd
like to see how much work it would be for me to change it to detect
checkpoints and do something like color the markers blue during
checkpoints and red elsewhen.
Also, I'm not sure how bad that graph really is. The overall
throughput is more variable, and there are a few latency spikes but
they are few. The dominant feature is simply that the long-term
average is less than the initial burst. Of course the goal is to have
a high level of throughput with a smooth latency under sustained
conditions. But to expect that that long-sustained smooth level of
throughput be identical to the "initial burst throughput" sounds like
more of a fantasy than a goal. If we want to accept the lowered
throughput and work on the what variability/spikes are there, I think
a good approach would be to take the long term TPS average, and dial
the number of clients back until the initial burst TPS matches that
long term average. Then see if the spikes still exist over the long
term using that dialed back number of clients.
> But the effect is much worse with
> full_page_writes=on: the distinctive parabolic shape of those graphs
> is apparently caused by the gradually decreasing frequency of full
> page writes as the number of transactions processed since the last
> checkpoint grows.
The dilution out of not-yet-full-page-written buffers as the
checkpoint ages is probably the ultimate cause of that pattern, but I
suspect there are intermediate causes that we could tackle.
I don't think the full-page-writes are leading to WALInsert
contention, for example, because that would probably lead to smooth
throughput decline, but not those latency spikes in which those entire
seconds go by without transactions. I doubt it is leading to general
IO compaction, as the IO at that point should be pretty much
sequential (the checkpoint has not yet reached the sync stage, the WAL
is sequential). So I bet that that is caused by fsyncs occurring at
xlog segment switches, and the locking that that entails. If I
recall, we can have a segment which is completely written to OS and in
the process of being fsynced, and we can have another segment which is
in some state of partially in wal_buffers and partly written out to OS
cache, but that we can't start reusing the wal_buffers that were
already written to OS for that segment (and therefore are
theoretically available for reuse by the upcoming 3rd segment) until
the previous segments fsync has completed. So all WALInsert's freeze.
Or something like that. This code has changed a bit since last time
I studied it.
In response to
pgsql-hackers by date
|Next:||From: Kevin Grittner||Date: 2012-02-04 19:24:33|
|Subject: Re: [v9.2] Add GUC sepgsql.client_label|
|Previous:||From: Dimitri Fontaine||Date: 2012-02-04 19:07:05|
|Subject: Re: Finer Extension dependencies|