Re: max_standby_delay considered harmful

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Florian Pflug <fgp(at)phlo(dot)org>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Bruce Momjian <bruce(at)momjian(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: max_standby_delay considered harmful
Date: 2010-05-12 19:25:14
Message-ID: 1273692314.308.1059.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2010-05-12 at 21:10 +0200, Stefan Kaltenbrunner wrote:

> > There is no evidence to link this behaviour with HS, as yet, and you
> > should be considering the possibility the problem lies elsewhere,
> > especially since it could be code you committed that is at fault.
>
> Well I'm not sure why people seem to have that hard a time reproducing
> that issue - it seems that I can provoke it really trivially(in this
> case no loops, no pgbench, no tricks). A few minutes ago I logged into
> my test standby (which is idle except for the odd connect to template1
> caused by nagios - the master is idle as well and has been for days):

Thanks, good report.

> so it restarted two times successfully - however if one looks at the
> third time one can see that it received the smart shutdown request
> BEFORE it reached a consistent recovery state - yet it continued to
> enable HS and reenabled SR as well.
>
> The database is now sitting there doing nothing and it more or less
> broken because you cannot connect to it in the current state:
>
> ~$ psql
> psql: FATAL: the database system is shutting down
>
> the startup process has the following backtrace:
>
> (gdb) bt
> #0 0x00007fbe24cb2c83 in select () from /lib/libc.so.6
> #1 0x00000000006e811a in pg_usleep ()
> #2 0x000000000048c333 in XLogPageRead ()
> #3 0x000000000048c967 in ReadRecord ()
> #4 0x0000000000493ab6 in StartupXLOG ()
> #5 0x0000000000495a88 in StartupProcessMain ()
> #6 0x00000000004ab25e in AuxiliaryProcessMain ()
> #7 0x00000000005d4a7d in StartChildProcess ()
> #8 0x00000000005d70c2 in PostmasterMain ()
> #9 0x000000000057d898 in main ()

Well, its waiting for new info from primary. Nothing to do with locking,
but that's not an indication that its an SR issue though either. ;-)

I'll put some waits into that part of the code and see if I can induce
the failure. Maybe its just a simple lack of a CHECK_FOR_INTERRUPTS().

--
Simon Riggs www.2ndQuadrant.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-05-12 19:30:24 Re: pg_upgrade versus MSVC build scripts
Previous Message Robert Haas 2010-05-12 19:23:27 Re: primary/secondary/master/slave/standby