Re: Recovery of Multi-stage WAL actions

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Recovery of Multi-stage WAL actions
Date: 2008-03-25 00:29:36
Message-ID: 200803250029.m2P0TaH27926@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Added to TODO:

> * Have resource managers report the duration of their status changes
>
> http://archives.postgresql.org/pgsql-hackers/2007-10/msg01468.php

---------------------------------------------------------------------------

Simon Riggs wrote:
> We've had two hard to diagnose errors in recovery in recent months. ISTM
> that the core issue is the way we allow Resource Managers to have
> multi-stage WAL actions that persist for long periods of time. This
> means we have no way of telling whether the answer
> rm_safe_restartpoint() == false is a momentary, valid state or a
> progressively worsening indicator of a subtle RM bug.
>
> An example of a multi-stage WAL action would be an index split inside
> one of the Resource Managers. Now that kind of action shouldn't take
> very long, though theoretically it could for various reasons.
>
> Right now we have a log message to cope with this:
> http://archives.postgresql.org/pgsql-committers/2007-08/msg00374.php
> but its not nearly as helpful as we'd like it to be. We could back-patch
> this to 8.2, but I have a potentially better proposal.
>
> I very much want to encourage authors of new Resource Managers and it
> looks like we may be getting at least 3 new RMs that produce WAL
> records: hash indexes (currently not WAL-logged), bitmap indexes and
> clustered indexes for 8.4. We should be realistic that new bugs probably
> will occur in recovery code for existing and new RMs.
>
> What I'd like to do is force all of the RMs to record the lsn of any WAL
> record that starts an incomplete action. Then, if an incomplete action
> lives for more than a certain period of time it will be possible to
> produce a log message saying "incomplete split has survived for X
> seconds, in xlog time". That way we'll see log messages if any of the
> RMs start to push an incomplete action onto their list and then not
> consume it again.
>
> We might trust each RM to implement code to LOG messages if their code
> goes a little awry, but I'd prefer some mechanism that allows the main
> server to check what's happening in each RM. That way we'd have a
> cross-check on whether the RM is well-behaved, plus we'd only need to
> implement the checking code once. Right now very similar, yet different
> code runs inside each RM.
>
> So my proposal is to have an incomplete split remember/forget API that
> forces each RM to expose its incomplete split List. Currently each RM
> has a hook on rm_saferestartpoint(), so that each RM manages its own
> List. My new thought is to have one function safeRestartPoint() that
> inspects each of the incomplete split Lists to see if they are empty. If
> the lists are non-empty then inspect the age of each list entry to see
> if it is worth reporting as a possible issue. Each RM would then store
> incomplete splits using a ResourceManagerRememberIncompleteEvent(lsn,
> id_data, payload??) and ResourceManagerforgetIncompleteEvent(id_data).
> Implementation is a a bit hazy on that last part, but I think the
> overall idea is clear.
>
> That should mean that any incomplete split that lasts for the length of
> one restartpoint, which is *at least* one checkpoint duration, should
> cause a LOG message to be produced. We might even go as far as to ignore
> super long-lived and therefore spurious incomplete splits when we issue
> rm_cleanup() for fear of allowing RM bugs to kill recovery.
>
> I'd like to suggest that those changes be performed now for 8.3 *and*
> back-patched for 8.2. I want to make sure that all users are able to
> diagnose server errors and report them. I'm guessing that might raise a
> few eyebrows, but I think its justifiable. Bugs in complex code are
> inevitable and should not be seen to reflect badly upon RM authors.
> However, our inability to recognise RM bugs that do occur doesn't seem
> acceptable to me, especially since they may save themselves up for the
> moment of PITR fail-over. You might persuade me I'm being over-zealous
> here, but High Availability is something we have to be zealous about.
>
> It should also be possible to allow the server to stay up even if one of
> the RM's fails to recover properly. That would need to be settable, so I
> really only mean that for "optional" RMs, i.e. index RMs only. For those
> cases we should be able to mark effected indexes by marking them
> corrupt. Automatic rebuild of corrupt indexes could also be possible,
> should it occur. That would be an 8.4 action... :-)
>
> Comments appreciated, as ever.
>
> --
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Denne 2008-03-25 00:42:17 Re: count(*) performance improvement ideas
Previous Message Bruce Momjian 2008-03-25 00:25:31 Re: [pgsql-www] New email list for emergency communications