Re: 16-bit page checksums for 9.2

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: simon(at)2ndquadrant(dot)com, ants(dot)aasma(at)eesti(dot)ee, heikki(dot)linnakangas(at)enterprisedb(dot)com, jeff(dot)janes(at)gmail(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 20:04:48
Message-ID: CA+TgmoY+QQSSF19K10VcYVUfBJaBKNdJKaw6wbt7o38=d2X=ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 4, 2012 at 1:32 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> we only fsync() at end-of-checkpoint.  So we'd have to think about
>> what to fsync, and how often, to keep the double-write buffer to a
>> manageable size.
>
> I think this is the big tuning challenge with this technology.

One of them, anyway. I think it may also be tricky to make sure that
a backend that needs to write a dirty buffer doesn't end up having to
wait for a double-write to be fsync'd.

>> I can't help thinking that any extra fsyncs are pretty expensive,
>> though, especially if you have to fsync() every file that's been
>> double-written before clearing the buffer. Possibly we could have
>> 2^N separate buffers based on an N-bit hash of the relfilenode and
>> segment number, so that we could just fsync 1/(2^N)-th of the open
>> files at a time.
>
> I'm not sure I'm following -- we would just be fsyncing those files
> we actually wrote pages into, right?  Not all segments for the table
> involved?

Yes.

>> But even that sounds expensive: writing back lots of dirty data
>> isn't cheap.  One of the systems I've been doing performance
>> testing on can sometimes take >15 seconds to write a shutdown
>> checkpoint,
>
> Consider the relation-file fsyncs for double-write as a form of
> checkpoint spreading, and maybe it won't seem so bad.  It should
> make that shutdown checkpoint less painful.  Now, I have been
> thinking that on a write-heavy system you had better have a BBU
> write-back cache, but that's my recommendation, anyway.

I think this point has possibly been beaten to death, but at the risk
of belaboring the point I'll bring it up again: the frequency with
which we fsync() is basically a trade-off between latency and
throughput. If you fsync a lot, then each one will be small, so you
shouldn't experience much latency, but throughput might suck. If you
don't fsync very much, then you maximize the chances for
write-combining (because inserting an fsync between two writes to the
same block forces that block to be physically written twice rather
than just once) thus improving throughput, but when you do get around
to calling fsync() there may be a lot of data to write all at once,
and you may get a gigantic latency spike.

As far as I can tell, one fsync per checkpoint is the theoretical
minimum, and that's what we do now. So our current system is
optimized for throughput. The decision to put full-page images into
WAL rather than a separate buffer is essentially turning the dial in
the same direction, because, in effect, the double-write fsync
piggybacks on the WAL fsync which we must do anyway. So both the
decision to use a double-write buffer AT ALL and the decision to fsync
more frequently to keep that buffer to a manageable size are going to
result in turning that dial in the opposite direction. It seems to me
inevitable that, even with the best possible implementation,
throughput will get worse. With a good implementation but not a bad
one, latency should improve.

Now, this is not necessarily a reason to reject the idea. I believe
that several people have proposed that our current implementation is
*overly* optimized for throughput *at the expense of* latency, and
that we might want to provide some options that, in one way or
another, fsync more frequently, so that checkpoint spikes aren't as
bad. But when it comes time to benchmark, we might need to think
somewhat carefully about what we're testing...

Another thought here is that double-writes may not be the best
solution, and are almost certainly not the easiest-to-implement
solution. We could instead do something like this: when an unlogged
change is made to a buffer (e.g. a hint bit is set), we set a flag on
the buffer header. When we evict such a buffer, we emit a WAL record
that just overwrites the whole buffer with a new FPI. There are some
pretty obvious usage patterns where this is likely to be painful (e.g.
load a big table without setting hint bits, and then seq-scan it).
But there are also many use cases where the working set fits inside
shared buffers and data pages don't get written very often, apart from
checkpoint time, and those cases might work just fine. Also, the
cases that are problems for this implementation are likely to also be
problems for a double-write based implementation, for exactly the same
reasons: if you discover at buffer eviction time that you need to
fsync something (whether it's WAL or DW), it's going to hurt.
Checksums aren't free even when using double-writes: if you don't have
checksums, pages that have only hint bit-changes don't need to be
double-written. If double writes aren't going to give us anything
"for free", maybe that's not the right place to be focusing our
efforts...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2012-01-04 20:13:28 Re: PL/Perl Does not Like vstrings
Previous Message Kevin Grittner 2012-01-04 20:02:01 Re: Page Checksums + Double Writes