Re: time-delayed standbys

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: time-delayed standbys
Date: 2011-05-07 13:48:06
Message-ID: BANLkTinfrgsVK_8o9+mw6kWRcC4BxiR4jw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Apr 23, 2011 at 9:46 PM, Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
> On Tue, Apr 19, 2011 at 9:47 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> That is, a standby configured such that replay lags a prescribed
>> amount of time behind the master.
>>
>> This seemed easy to implement, so I did.  Patch (for 9.2, obviously) attached.
>>
>
> This crashes when stoping recovery to a target (i tried with a named
> restore point and with a poin in time) after executing
> pg_xlog_replay_resume(). here is the backtrace. I will try to check
> later but i wanted to report it before...
>
> #0  0xb7777537 in raise () from /lib/libc.so.6
> #1  0xb777a922 in abort () from /lib/libc.so.6
> #2  0x08393a19 in errfinish (dummy=0) at elog.c:513
> #3  0x083944ba in elog_finish (elevel=22, fmt=0x83d5221 "wal receiver
> still active") at elog.c:1156
> #4  0x080f04cb in StartupXLOG () at xlog.c:6691
> #5  0x080f2825 in StartupProcessMain () at xlog.c:10050
> #6  0x0811468f in AuxiliaryProcessMain (argc=2, argv=0xbfa326a8) at
> bootstrap.c:417
> #7  0x0827c2ea in StartChildProcess (type=StartupProcess) at postmaster.c:4488
> #8  0x08280b85 in PostmasterMain (argc=3, argv=0xa4c17e8) at postmaster.c:1106
> #9  0x0821730f in main (argc=3, argv=0xa4c17e8) at main.c:199

Sorry for the slow response on this - I was on vacation for a week and
my schedule got a big hole in it.

I was able to reproduce something very like this in unpatched master,
just by letting recovery pause at a named restore point, and then
resuming it.

LOG: recovery stopping at restore point "stop", time 2011-05-07
09:28:01.652958-04
LOG: recovery has paused
HINT: Execute pg_xlog_replay_resume() to continue.
(at this point I did pg_xlog_replay_resume())
LOG: redo done at 0/5000020
PANIC: wal receiver still active
LOG: startup process (PID 38762) was terminated by signal 6: Abort trap
LOG: terminating any other active server processes

I'm thinking that this code is wrong:

if (recoveryPauseAtTarget && standbyState ==
STANDBY_SNAPSHOT_READY)
{
SetRecoveryPause(true);
recoveryPausesHere();
}
reachedStopPoint = true; /* see below */
recoveryContinue = false;

I think that recoveryContinue = false assignment should not happen if
we decide to pause. That is, we should say if (recoveryPauseAtTarget
&& standbyState == STANDBY_SNAPSHOT_READY) { same as now } else
recoveryContinue = false.

I haven't tested that, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-05-07 13:50:40 Re: Fix for pg_upgrade user flag
Previous Message Bruce Momjian 2011-05-07 12:56:10 Fix for pg_upgrade user flag