Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)
Date: 2003-01-27 20:26:27
Message-ID: 200301272026.h0RKQS329900@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance


Is there a TODO here? I like the idea of not writing pg_controldata, or
at least allowing it not to be read, perhaps with a pg_resetxlog flag so
we can cleanly recover from a corrupt pg_controldata if the WAL files
are OK.

We don't want to get rid of the WAL file rename optimization because
those are 16mb files and keeping them from checkpoint to checkpoint is
probably a win. I also like the idea of allowing something between our
"at the instant" recovery, and no recovery with fsync off. A "recover
from last checkpooint time" option would be really valuable for some.

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > > One question I have is: in the event of a crash, why not simply replay
> > > all the transactions found in the WAL? Is the startup time of the
> > > database that badly affected if pg_control is ignored?
> >
> > Interesting thought, indeed. Since we truncate the WAL after each
> > checkpoint, seems like this approach would no more than double the time
> > for restart.
>
> Hmm...truncating the WAL after each checkpoint minimizes the amount of
> disk space eaten by the WAL, but on the other hand keeping older
> segments around buys you some safety in the event that things get
> really hosed. But your later comments make it sound like the older
> WAL segments are kept around anyway, just rotated.
>
> > The win is it'd eliminate pg_control as a single point of
> > failure. It's always bothered me that we have to update pg_control on
> > every checkpoint --- it should be a write-pretty-darn-seldom file,
> > considering how critical it is.
> >
> > I think we'd have to make some changes in the code for deleting old
> > WAL segments --- right now it's not careful to delete them in order.
> > But surely that can be coped with.
>
> Even that might not be necessary. See below.
>
> > OTOH, this might just move the locus for fatal failures out of
> > pg_control and into the OS' algorithms for writing directory updates.
> > We would have no cross-check that the set of WAL file names visible in
> > pg_xlog is sensible or aligned with the true state of the datafile
> > area.
>
> Well, what we somehow need to guarantee is that there is always WAL
> data that is older than the newest consistent data in the datafile
> area, right? Meaning that if the datafile area gets scribbled on in
> an inconsistent manner, you always have WAL data to fill in the gaps.
>
> Right now we do that by using fsync() and sync(). But I think it
> would be highly desirable to be able to more or less guarantee
> database consistency even if fsync were turned off. The price for
> that might be too high, though.
>
> > We'd have to take it on faith that we should replay the visible files
> > in their name order. This might mean we'd have to abandon the current
> > hack of recycling xlog segments by renaming them --- which would be a
> > nontrivial performance hit.
>
> It's probably a bad idea for the replay to be based on the filenames.
> Instead, it should probably be based strictly on the contents of the
> xlog segment files. Seems to me the beginning of each segment file
> should have some kind of header information that makes it clear where
> in the scheme of things it belongs. Additionally, writing some sort
> of checksum, either at the beginning or the end, might not be a bad
> idea either (doesn't have to be a strict checksum, but it needs to be
> something that's reasonably likely to catch corruption within a
> segment).
>
> Do that, and you don't have to worry about renaming xlog segments at
> all: you simply move on to the next logical segment in the list (a
> replay just reads the header info for all the segments and orders the
> list as it sees fit, and discards all segments prior to any gap it
> finds. It may be that you simply have to bail out if you find a gap,
> though). As long as the xlog segment checksum information is
> consistent with the contents of the segment and as long as its
> transactions pick up where the previous segment's left off (assuming
> it's not the first segment, of course), you can safely replay the
> transactions it contains.
>
> I presume we're recycling xlog segments in order to avoid file
> creation and unlink overhead? Otherwise you can simply create new
> segments as needed and unlink old segments as policy dictates.
>
> > Comments anyone?
> >
> > > If there exists somewhere a reasonably succinct description of the
> > > reasoning behind the current transaction management scheme (including
> > > an analysis of the pros and cons), I'd love to read it and quit
> > > bugging you. :-)
> >
> > Not that I know of. Would you care to prepare such a writeup? There
> > is a lot of material in the source-code comments, but no coherent
> > presentation.
>
> Be happy to. Just point me to any non-obvious source files.
>
> Thus far on my plate:
>
> 1. PID file locking for postmaster startup (doesn't strictly need
> to be the PID file but it may as well be, since we're already
> messing with it anyway). I'm currently looking at how to do
> the autoconf tests, since I've never developed using autoconf
> before.
>
> 2. Documenting the transaction management scheme.
>
> I was initially interested in implementing the explicit JOIN
> reordering but based on your recent comments I think you have a much
> better handle on that than I. I'll be very interested to see what you
> do, to see if it's anything close to what I figure has to happen...
>
>
> --
> Kevin Brown kevin(at)sysexperts(dot)com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2003-01-27 20:49:06 Re: Request for qualified column names
Previous Message Bruce Momjian 2003-01-27 20:14:57 Re: [SQL] Function for adding Money type

Browse pgsql-performance by date

  From Date Subject
Next Message Matt Mello 2003-01-27 20:39:57 Indexing foreign keys
Previous Message Bruce Momjian 2003-01-27 20:11:18 Re: Mount options for Ext3?