Re: Checkpoint cost, looks like it is WAL/CRC

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com
Cc: Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checkpoint cost, looks like it is WAL/CRC
Date: 2005-07-06 20:48:44
Message-ID: 1120682924.3940.172.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> >> Uh, what exactly did you cut out? I suggested dropping the dumping of
> >> full page images, but not removing CRCs altogether ...
>
> > Attached is the patch I used.
>
> OK, thanks for the clarification. So it does seem that dumping full
> page images is a pretty big hit these days.

Yes the performance results are fairly damning. That's a shame, I
convinced myself that the CRC32 and block-hole compression was enough.

The 50% performance gain isn't the main thing for me. The 10 sec drop in
response time immediately after checkpoint is the real issue. Most sites
are looking for good response as an imperative, rather than throughput.

Overall, IMHO we must do something about this for 8.1. Possibly
something more for 8.2 also, but definitely *something* now.

> (In defense of the original
> idea, I believe it was not such a hit at the time --- but as we continue
> to improve performance, things that weren't originally at the top of the
> profile become significant.)

No defense required. As you say, it was the best idea at the time.

> It seems like we have two basic alternatives:
>
> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> if you really trust your hardware :-(
>
> 2. Think of a better defense against partial-page writes.
>
> I like #2, or would if I could think of a better defense. Ideas anyone?

Well, I'm all for #2 if we can think of one that will work. I can't.

Option #1 seems like the way forward, but I don't think it is
sufficiently safe just to have the option to turn things off.

With wal_changed_pages= off *any* crash would possibly require an
archive recovery, or a replication rebuild. It's good that we now have
PITR, but we do also have other options for availability. Users of
replication could well be amongst the first to try out this option.

The problem is that you just wouldn't *know* whether the possibly was
yes or no. The temptation would be to assume "no" and just continue,
which could lead to data loss. And that would lead to a lack of trust in
PostgreSQL and eventual reputational loss. Would I do an archive
recovery, or would I trust that RAID array had written everything
properly? With an irate Web Site Manager saying "you think? it might?
maybe? You mean you don't know???"

If we pick option #1, it *must* also include a way of deciding whether a
crash recovery has succeeded, or not. Other commentators have mentioned
the torn_pages option for other DBMS. It seems we also need an
indicator. That should be a non-optional feature of 8.1 and then perhaps
an optional feature in later releases when we have more experience to
say that turning it off is OK in some circumstances.

We could implement the torn-pages option, but that seems a lot of work.
Another way of implementing a tell-tale would be to append the LSN again
as a data page trailer as the last 4 bytes of the page. Thus the LSN
would be both the first and last item on the data page. Any partial
write would fail to update the LSN trailer and we would be able to see
that a page was torn. That's considerably easier than trying to write a
torn page tell-tale to each 512 byte sector of a page as SQLServer does.

During recovery, if a full page image is not available, we would read
the page from the database and check that the first and last LSNs match.
If they do, then the page is not torn and recovery can be successful. If
they do not match, then we attempt to continue recovery, but issue a
warning that torn page has been detected and a full archive recovery is
recommended. It is likely that the recovery itself will fail almost
immediately following this, since changes will try to be made to a page
in the wrong state to receive it, but there's no harm in trying....

Like this specific idea or not, I'm saying that we need a tell-tale: a
way of knowing whether we have a torn page, or not. That way we can
safely continue to rely upon crash recovery.

Tom, I think you're the only person that could or would be trusted to
make such a change. Even past the 8.1 freeze, I say we need to do
something now on this issue.

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-07-06 21:06:46 Re: timezone changes break windows and cygwin
Previous Message Darren Alcorn 2005-07-06 20:28:35 SQL99 - Nested Tables