Re: Checkpoint cost, looks like it is WAL/CRC

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checkpoint cost, looks like it is WAL/CRC
Date: 2005-07-06 22:22:38
Message-ID: 200507062222.j66MMcm04984@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs wrote:
> On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote:
> > Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > >> Uh, what exactly did you cut out? I suggested dropping the dumping of
> > >> full page images, but not removing CRCs altogether ...
> >
> > > Attached is the patch I used.
> >
> > OK, thanks for the clarification. So it does seem that dumping full
> > page images is a pretty big hit these days.
>
> Yes the performance results are fairly damning. That's a shame, I
> convinced myself that the CRC32 and block-hole compression was enough.
>
> The 50% performance gain isn't the main thing for me. The 10 sec drop in
> response time immediately after checkpoint is the real issue. Most sites
> are looking for good response as an imperative, rather than throughput.

Yep.

> No defense required. As you say, it was the best idea at the time.
>
> > It seems like we have two basic alternatives:
> >
> > 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> > if you really trust your hardware :-(
> >
> > 2. Think of a better defense against partial-page writes.
> >
> > I like #2, or would if I could think of a better defense. Ideas anyone?
>
> Well, I'm all for #2 if we can think of one that will work. I can't.
>
> Option #1 seems like the way forward, but I don't think it is
> sufficiently safe just to have the option to turn things off.

Well, I added #1 yesterday as 'full_page_writes', and it has the same
warnings as fsync (namely, on crash, be prepared to recovery or check
your system thoroughly.

As far as #2, my posted proposal was to write the full pages to WAL when
they are written to the file system, and not when they are first
modified in the shared buffers --- the goal being that it will even out
the load, and it will happen in a non-critical path, hopefully by the
background writer or at checkpoint time.

> With wal_changed_pages= off *any* crash would possibly require an
> archive recovery, or a replication rebuild. It's good that we now have
> PITR, but we do also have other options for availability. Users of
> replication could well be amongst the first to try out this option.

Seems it is similar to fsync in risk, which is not a new option.

> The problem is that you just wouldn't *know* whether the possibly was
> yes or no. The temptation would be to assume "no" and just continue,
> which could lead to data loss. And that would lead to a lack of trust in
> PostgreSQL and eventual reputational loss. Would I do an archive
> recovery, or would I trust that RAID array had written everything
> properly? With an irate Web Site Manager saying "you think? it might?
> maybe? You mean you don't know???"

That is a serious problem, but the same problem we have in turning off
fsync.

> During recovery, if a full page image is not available, we would read
> the page from the database and check that the first and last LSNs match.
> If they do, then the page is not torn and recovery can be successful. If
> they do not match, then we attempt to continue recovery, but issue a
> warning that torn page has been detected and a full archive recovery is
> recommended. It is likely that the recovery itself will fail almost
> immediately following this, since changes will try to be made to a page
> in the wrong state to receive it, but there's no harm in trying....

I like the idea of checking the page during recovery so we don't have to
check all the pages, just certain pages.

> Like this specific idea or not, I'm saying that we need a tell-tale: a
> way of knowing whether we have a torn page, or not. That way we can
> safely continue to rely upon crash recovery.
>
> Tom, I think you're the only person that could or would be trusted to
> make such a change. Even past the 8.1 freeze, I say we need to do
> something now on this issue.

I think if we document full_page_writes as similar to fsync in risk, we
are OK for 8.1, but if something can be done easily, it sounds good.

Now that we have a GUC we can experiment with the full page write load
and see how it can be improved.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-07-06 22:36:24 Re: Schedule for 8.1 feature freeze
Previous Message Bruce Momjian 2005-07-06 21:40:29 Re: timezone changes break windows and cygwin