On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
> On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
>>> The double write buffer is one of the few areas where InnoDB does more
>>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
>>> checkpoints (which help to keep dirty pages in memory longer),
>>> buffering of writing out changes to secondary indexes, and recently
>>> tunable page level compression.
>> Baron Schwartz was talking to me about this at Surge. I don't really
>> understand how the fuzzy checkpoint stuff works, and I haven't been
>> able to find a good description of it anywhere. How does it keep
>> dirty pages in memory longer? Details on the other things you mention
>> would be interesting to hear, too.
> For checkpoint behavior:
> I would think that best case behavior "sharp" checkpoints with a large
> checkpoint_completion_target would have behavior similar to a fuzzy
Well, under that definition of a fuzzy checkpoint, our checkpoints are
fuzzy even with checkpoint_completion_target=0.
What Baron seemed to be describing was a scheme whereby you could do
what I might call partial checkpoints. IOW, you want to move the redo
pointer without writing out ALL the dirty buffers in memory, so you
write out the pages with the oldest LSNs and then move the redo
pointer to the oldest LSN you have left. Except that doesn't quite
work, because the page might have been dirtied at LSN X and then later
updated again at LSN Y, and you still have to flush it to disk before
moving the redo pointer to any value >X. So you work around that by
maintaining a "first dirtied" LSN for each page as well as the current
I'm not 100% sure that this is how it works or that it would work in
PG, but even assuming that it is and does, I'm not sure what the
benefit is over the checkpoint-spreading logic we have now. There
might be some benefit in sorting the writes that we do, so that we can
spread out the fsyncs. So, write all the blocks to a give file,
fsync, and then repeat for each underlying data file that has at least
one dirty block. But that's completely orthogonal to (and would
actually be hindered by) the approach described in the preceding
> Insert (for innodb 1.1+ evidently there is also does delete and purge)
We do something a bit like this for GIST indices. It would be
interesting to see if it also has a benefit for btree indices.
> For a recent ~800GB db I had to restore, the insert buffer saved 92%
> of io needed for secondary indexes.
> For many workloads 50% compression results in negligible impact to
> performance. For certain workloads compression can help performance.
> Please note that InnoDB also has non-tunable toast like feature.
Interesting. I am surprised this works well. It seems that this only
works for pages that can be compressed by >=50%, which seems like it
could result in a lot of CPU wasted on failed attempts to compress.
>>> Given that InnoDB is not shipping its logs across the wire, I don't
>>> think many users would really care if it used the double writer or
>>> full page writes approach to the redo log (other than the fact that
>>> the log files would be bigger). PG on the other hand *is* pushing its
>>> logs over the wire...
>> So how is InnoDB doing replication? Is there a second log just for that?
> The other log is the "binary log" and it is one of the biggest
> problems with MySQL. Running MySQL in such a way that the binary log
> stays in sync with the InnoDB redo has a very significant impact on
> (check out the pretty graph)
Hmm. That seems kinda painful. Having to ship full page images over
the wire doesn't seems so bad by comparison, though I'm not very happy
about having to do that either.
The Enterprise PostgreSQL Company
In response to
pgsql-performance by date
|Next:||From: Robert Haas||Date: 2010-10-28 02:01:07|
|Subject: Re: Re: Postgres insert performance and storage requirement
compared to Oracle|
|Previous:||From: Craig Ringer||Date: 2010-10-27 23:24:33|
|Subject: Re: Massive update, memory usage|
pgsql-www by date
|Next:||From: Rob Wultsch||Date: 2010-10-28 03:43:52|
|Subject: Re: BBU Cache vs. spindles|
|Previous:||From: Rob Wultsch||Date: 2010-10-27 04:41:55|
|Subject: Re: BBU Cache vs. spindles|