On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
>> On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
>>>> The double write buffer is one of the few areas where InnoDB does more
>>>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
>>>> checkpoints (which help to keep dirty pages in memory longer),
>>>> buffering of writing out changes to secondary indexes, and recently
>>>> tunable page level compression.
>>> Baron Schwartz was talking to me about this at Surge. I don't really
>>> understand how the fuzzy checkpoint stuff works, and I haven't been
>>> able to find a good description of it anywhere. How does it keep
>>> dirty pages in memory longer? Details on the other things you mention
>>> would be interesting to hear, too.
>> For checkpoint behavior:
>> I would think that best case behavior "sharp" checkpoints with a large
>> checkpoint_completion_target would have behavior similar to a fuzzy
> Well, under that definition of a fuzzy checkpoint, our checkpoints are
> fuzzy even with checkpoint_completion_target=0.
> What Baron seemed to be describing was a scheme whereby you could do
> what I might call partial checkpoints. IOW, you want to move the redo
> pointer without writing out ALL the dirty buffers in memory, so you
> write out the pages with the oldest LSNs and then move the redo
> pointer to the oldest LSN you have left. Except that doesn't quite
> work, because the page might have been dirtied at LSN X and then later
> updated again at LSN Y, and you still have to flush it to disk before
> moving the redo pointer to any value >X. So you work around that by
> maintaining a "first dirtied" LSN for each page as well as the current
> I'm not 100% sure that this is how it works or that it would work in
> PG, but even assuming that it is and does, I'm not sure what the
> benefit is over the checkpoint-spreading logic we have now. There
> might be some benefit in sorting the writes that we do, so that we can
> spread out the fsyncs. So, write all the blocks to a give file,
> fsync, and then repeat for each underlying data file that has at least
> one dirty block. But that's completely orthogonal to (and would
> actually be hindered by) the approach described in the preceding
I wish I could answer your questions better. I am a power user that
does not fully understand InnoDB internals. There are not all that
many folks that have a very good understanding of InnoDB internals
(given how well it works there is not all that much need).
>> Insert (for innodb 1.1+ evidently there is also does delete and purge)
> We do something a bit like this for GIST indices. It would be
> interesting to see if it also has a benefit for btree indices.
>> For a recent ~800GB db I had to restore, the insert buffer saved 92%
>> of io needed for secondary indexes.
>> For many workloads 50% compression results in negligible impact to
>> performance. For certain workloads compression can help performance.
>> Please note that InnoDB also has non-tunable toast like feature.
> Interesting. I am surprised this works well. It seems that this only
> works for pages that can be compressed by >=50%, which seems like it
> could result in a lot of CPU wasted on failed attempts to compress.
In my world, the spinning disk is almost always the bottleneck.
Trading CPU for IO is almost always a good deal for me.
>>>> Given that InnoDB is not shipping its logs across the wire, I don't
>>>> think many users would really care if it used the double writer or
>>>> full page writes approach to the redo log (other than the fact that
>>>> the log files would be bigger). PG on the other hand *is* pushing its
>>>> logs over the wire...
>>> So how is InnoDB doing replication? Is there a second log just for that?
>> The other log is the "binary log" and it is one of the biggest
>> problems with MySQL. Running MySQL in such a way that the binary log
>> stays in sync with the InnoDB redo has a very significant impact on
>> (check out the pretty graph)
> Hmm. That seems kinda painful. Having to ship full page images over
> the wire doesn't seems so bad by comparison, though I'm not very happy
> about having to do that either.
The binary log is less than ideal, but with MySQL replication I can
replicate to *many* servers that are *very* geographically distributed
without all that many headaches. In addition it is simple enough that
I can have junior DBA manage it. I have doubts that I could make PG
do the same anywhere near as easily, particularly given how long and
narrow some pipes are...
In response to
pgsql-performance by date
|Next:||From: Alex Hunsaker||Date: 2010-10-28 04:23:44|
|Subject: Re: Postgres insert performance and storage requirement
compared to Oracle|
|Previous:||From: Tom Lane||Date: 2010-10-28 03:32:25|
|Subject: Re: Re: Postgres insert performance and storage requirement compared to Oracle |
pgsql-www by date
|Next:||From: James Mansion||Date: 2010-10-28 20:33:19|
|Subject: Re: BBU Cache vs. spindles|
|Previous:||From: Robert Haas||Date: 2010-10-28 01:55:06|
|Subject: Re: BBU Cache vs. spindles|