Re: 9.0beta2 - server crash when using HS + SR

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.0beta2 - server crash when using HS + SR
Date: 2010-06-13 16:42:49
Message-ID: AANLkTinxVuWOSS4egXx6A6bLmMrNsxuYS7kFfgKPLwkp@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 11, 2010 at 9:29 PM, Rafael Martinez
<r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no> wrote:
> I am testing HS + SR in a system running 9.0beta2. What I am doing is
> just trying all kind of crazy combinations and see how the system
> handles them.

Thanks!

> One of the test I knew was going to fail was to create a tablespace in
> the master node with the directory used by the tablespace existing in
> the master and not in the standby node.
>
> What I didn't expect was such a serious consequence. Postgres crashed in
> the standby node and it refused to start until the directory needed by
> the tablespace was created also in the standby.
>
> I suppose there is not an easy way of fixing this, but at least it would
> be a good idea to update the documentation with some information about
> how to fix this error situation (hot-standby.html#HOT-STANDBY-CAVEATS
> will be a nice place to have this information)
>
> Another thing is that the HINT message in the logs was a little
> misleading. The server is down and it will not start without fixing the
> cause of the problem.
> - ----------------------------------------------------
> FATAL:  directory "/var/pgsql/ts_test" does not exist
> CONTEXT:  xlog redo create ts: 20177 "/var/pgsql/ts_test"
> LOG:  startup process (PID 10147) exited with exit code 1
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back
> the current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.

I think the behavior is correct (what else would we do? we must be
able to replace the subsequent WAL records that use the new
tablespace) but I agree that the hint is a little misleading.
Ideally, it seems like we'd like to issue that hint if we're planning
to restart, but not otherwise. You get that same message, for
example, if the DBA performs an immediate shutdown.

I'm somewhat disinclined to try to address this for 9.0. We've had
this problem for a long time, and I'm not sure that the fact that it
can now happen in a slightly wider set of circumstances is enough
reason to engineer a solution so close to release time, nor am I sure
what that other solution would look like. But I'm open to other
opinions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2010-06-13 17:04:28 Re: Command to prune archive at restartpoints
Previous Message Robert Haas 2010-06-13 16:31:31 Re: Command to prune archive at restartpoints