Re: WAL replay logic (was Re: Mount options for Ext3?)

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL replay logic (was Re: Mount options for Ext3?)
Date: 2003-02-14 14:30:30
Message-ID: 200302141430.h1EEUUr06607@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance


Is there a TODO here, like "Allow recovery from corrupt pg_control via
WAL"?

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > > One question I have is: in the event of a crash, why not simply replay
> > > all the transactions found in the WAL? Is the startup time of the
> > > database that badly affected if pg_control is ignored?
> >
> > Interesting thought, indeed. Since we truncate the WAL after each
> > checkpoint, seems like this approach would no more than double the time
> > for restart.
>
> Hmm...truncating the WAL after each checkpoint minimizes the amount of
> disk space eaten by the WAL, but on the other hand keeping older
> segments around buys you some safety in the event that things get
> really hosed. But your later comments make it sound like the older
> WAL segments are kept around anyway, just rotated.
>
> > The win is it'd eliminate pg_control as a single point of
> > failure. It's always bothered me that we have to update pg_control on
> > every checkpoint --- it should be a write-pretty-darn-seldom file,
> > considering how critical it is.
> >
> > I think we'd have to make some changes in the code for deleting old
> > WAL segments --- right now it's not careful to delete them in order.
> > But surely that can be coped with.
>
> Even that might not be necessary. See below.
>
> > OTOH, this might just move the locus for fatal failures out of
> > pg_control and into the OS' algorithms for writing directory updates.
> > We would have no cross-check that the set of WAL file names visible in
> > pg_xlog is sensible or aligned with the true state of the datafile
> > area.
>
> Well, what we somehow need to guarantee is that there is always WAL
> data that is older than the newest consistent data in the datafile
> area, right? Meaning that if the datafile area gets scribbled on in
> an inconsistent manner, you always have WAL data to fill in the gaps.
>
> Right now we do that by using fsync() and sync(). But I think it
> would be highly desirable to be able to more or less guarantee
> database consistency even if fsync were turned off. The price for
> that might be too high, though.
>
> > We'd have to take it on faith that we should replay the visible files
> > in their name order. This might mean we'd have to abandon the current
> > hack of recycling xlog segments by renaming them --- which would be a
> > nontrivial performance hit.
>
> It's probably a bad idea for the replay to be based on the filenames.
> Instead, it should probably be based strictly on the contents of the
> xlog segment files. Seems to me the beginning of each segment file
> should have some kind of header information that makes it clear where
> in the scheme of things it belongs. Additionally, writing some sort
> of checksum, either at the beginning or the end, might not be a bad
> idea either (doesn't have to be a strict checksum, but it needs to be
> something that's reasonably likely to catch corruption within a
> segment).
>
> Do that, and you don't have to worry about renaming xlog segments at
> all: you simply move on to the next logical segment in the list (a
> replay just reads the header info for all the segments and orders the
> list as it sees fit, and discards all segments prior to any gap it
> finds. It may be that you simply have to bail out if you find a gap,
> though). As long as the xlog segment checksum information is
> consistent with the contents of the segment and as long as its
> transactions pick up where the previous segment's left off (assuming
> it's not the first segment, of course), you can safely replay the
> transactions it contains.
>
> I presume we're recycling xlog segments in order to avoid file
> creation and unlink overhead? Otherwise you can simply create new
> segments as needed and unlink old segments as policy dictates.
>
> > Comments anyone?
> >
> > > If there exists somewhere a reasonably succinct description of the
> > > reasoning behind the current transaction management scheme (including
> > > an analysis of the pros and cons), I'd love to read it and quit
> > > bugging you. :-)
> >
> > Not that I know of. Would you care to prepare such a writeup? There
> > is a lot of material in the source-code comments, but no coherent
> > presentation.
>
> Be happy to. Just point me to any non-obvious source files.
>
> Thus far on my plate:
>
> 1. PID file locking for postmaster startup (doesn't strictly need
> to be the PID file but it may as well be, since we're already
> messing with it anyway). I'm currently looking at how to do
> the autoconf tests, since I've never developed using autoconf
> before.
>
> 2. Documenting the transaction management scheme.
>
> I was initially interested in implementing the explicit JOIN
> reordering but based on your recent comments I think you have a much
> better handle on that than I. I'll be very interested to see what you
> do, to see if it's anything close to what I figure has to happen...
>
>
> --
> Kevin Brown kevin(at)sysexperts(dot)com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-02-14 14:32:07 Re: Brain dump: btree collapsing
Previous Message Bruce Momjian 2003-02-14 14:29:34 Re: location of the configuration files

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2003-02-14 15:05:44 Re: Tuning scenarios (was Changing the default configuration)
Previous Message Bruce Momjian 2003-02-14 13:11:37 Re: [HACKERS] Terrible performance on wide selects