Re: BBU Cache vs. spindles

From: Rob Wultsch <wultsch(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Greg Smith <greg(at)2ndquadrant(dot)com>, jd(at)commandprompt(dot)com, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Steve Crawford <scrawford(at)pinpointresearch(dot)com>, pgsql-performance(at)postgresql(dot)org, Ben Chobot <bench(at)silentmedia(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: BBU Cache vs. spindles
Date: 2010-10-28 03:43:52
Message-ID: AANLkTikusb=5Q5KLCQWE8ctfUNaO5kqJjdeVk1nz2xVX@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance pgsql-www

On Wed, Oct 27, 2010 at 6:55 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Oct 27, 2010 at 12:41 AM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
>> On Tue, Oct 26, 2010 at 7:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Tue, Oct 26, 2010 at 10:13 AM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
>>>> The double write buffer is one of the few areas where InnoDB does more
>>>> IO (in the form of fsynch's) than PG. InnoDB also has fuzzy
>>>> checkpoints (which help to keep dirty pages in memory longer),
>>>> buffering of writing out changes to secondary indexes, and recently
>>>> tunable page level compression.
>>>
>>> Baron Schwartz was talking to me about this at Surge.  I don't really
>>> understand how the fuzzy checkpoint stuff works, and I haven't been
>>> able to find a good description of it anywhere.  How does it keep
>>> dirty pages in memory longer?  Details on the other things you mention
>>> would be interesting to hear, too.
>>
>> For checkpoint behavior:
>> http://books.google.com/books?id=S_yHERPRZScC&pg=PA606&lpg=PA606&dq=fuzzy+checkpoint&source=bl&ots=JJrzRUKBGh&sig=UOMPsRy5E-YDgjAFkaSVn3dps_M&hl=en&ei=_k8yTOfeHYzZnAepyumLBA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CEYQ6AEwBw#v=onepage&q=fuzzy%20checkpoint&f=false
>>
>> I would think that best case behavior "sharp" checkpoints with a large
>> checkpoint_completion_target would have behavior similar to a fuzzy
>> checkpoint.
>
> Well, under that definition of a fuzzy checkpoint, our checkpoints are
> fuzzy even with checkpoint_completion_target=0.
>
> What Baron seemed to be describing was a scheme whereby you could do
> what I might call partial checkpoints.  IOW, you want to move the redo
> pointer without writing out ALL the dirty buffers in memory, so you
> write out the pages with the oldest LSNs and then move the redo
> pointer to the oldest LSN you have left.  Except that doesn't quite
> work, because the page might have been dirtied at LSN X and then later
> updated again at LSN Y, and you still have to flush it to disk before
> moving the redo pointer to any value >X.  So you work around that by
> maintaining a "first dirtied" LSN for each page as well as the current
> LSN.
>
> I'm not 100% sure that this is how it works or that it would work in
> PG, but even assuming that it is and does, I'm not sure what the
> benefit is over the checkpoint-spreading logic we have now.  There
> might be some benefit in sorting the writes that we do, so that we can
> spread out the fsyncs.  So, write all the blocks to a give file,
> fsync, and then repeat for each underlying data file that has at least
> one dirty block.  But that's completely orthogonal to (and would
> actually be hindered by) the approach described in the preceding
> paragraph.

I wish I could answer your questions better. I am a power user that
does not fully understand InnoDB internals. There are not all that
many folks that have a very good understanding of InnoDB internals
(given how well it works there is not all that much need).

>
>> Insert (for innodb 1.1+ evidently there is also does delete and purge)
>> buffering:
>> http://dev.mysql.com/doc/refman/5.5/en/innodb-insert-buffering.html
>
> We do something a bit like this for GIST indices.  It would be
> interesting to see if it also has a benefit for btree indices.
>
>> For a recent ~800GB db I had to restore, the insert buffer saved 92%
>> of io needed for secondary indexes.
>>
>> Compression:
>> http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-compression-internals.html
>>
>> For many workloads 50% compression results in negligible impact to
>> performance. For certain workloads compression can help performance.
>> Please note that InnoDB also has non-tunable toast like feature.
>
> Interesting.  I am surprised this works well.  It seems that this only
> works for pages that can be compressed by >=50%, which seems like it
> could result in a lot of CPU wasted on failed attempts to compress.

In my world, the spinning disk is almost always the bottleneck.
Trading CPU for IO is almost always a good deal for me.

>
>>>> Given that InnoDB is not shipping its logs across the wire, I don't
>>>> think many users would really care if it used the double writer or
>>>> full page writes approach to the redo log (other than the fact that
>>>> the log files would be bigger). PG on the other hand *is* pushing its
>>>> logs over the wire...
>>>
>>> So how is InnoDB doing replication?  Is there a second log just for that?
>>>
>>
>> The other log is the "binary log" and it is one of the biggest
>> problems with MySQL. Running MySQL in such a way that the binary log
>> stays in sync with the InnoDB redo has a very significant impact on
>> performance.
>> http://www.mysqlperformanceblog.com/2010/10/23/mysql-limitations-part-2-the-binary-log/
>> http://mysqlha.blogspot.com/2010/09/mysql-versus-mongodb-update-performance.html
>> (check out the pretty graph)
>
> Hmm.  That seems kinda painful.  Having to ship full page images over
> the wire doesn't seems so bad by comparison, though I'm not very happy
> about having to do that either.
>

The binary log is less than ideal, but with MySQL replication I can
replicate to *many* servers that are *very* geographically distributed
without all that many headaches. In addition it is simple enough that
I can have junior DBA manage it. I have doubts that I could make PG
do the same anywhere near as easily, particularly given how long and
narrow some pipes are...

--
Rob Wultsch
wultsch(at)gmail(dot)com

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Alex Hunsaker 2010-10-28 04:23:44 Re: Postgres insert performance and storage requirement compared to Oracle
Previous Message Tom Lane 2010-10-28 03:32:25 Re: Re: Postgres insert performance and storage requirement compared to Oracle

Browse pgsql-www by date

  From Date Subject
Next Message James Mansion 2010-10-28 20:33:19 Re: BBU Cache vs. spindles
Previous Message Robert Haas 2010-10-28 01:55:06 Re: BBU Cache vs. spindles