Protecting against multiple instances per cluster

From: Thom Brown <thom(at)linux(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Protecting against multiple instances per cluster
Date: 2011-09-08 18:40:44
Message-ID: CAA-aLv5G5Yb2oRafmueDXHNDEE01zs=NuzvfFZ80X=Wh3uvuyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

I've come across a PostgreSQL set up where there are 2 servers, each
with the same version of PostgreSQL on, both mounting the same SAN
onto their respective file systems. It was intended that only 1 of
the servers would be running an instance of PostgreSQL at a time as
they both point to the same pgdata. This was dubbed a "high
availability" set up, where if one server went down, they could start
PostgreSQL on the other. (yes, I know what you're thinking) Now
normally there is protection against 2 instances running only if the
instances on the same server as it would reference shared memory. But
in this case, neither server has access to the other's shared memory,
so it has to rely on the pid file. But the pid file isn't enough by
itself. In this set up, if someone were to inadvertently start up a
Postgres instance on the 2nd server whilst the 1st was still running,
it would do very bad things.

For example, when I set up the same scenario on my own network, it
indeed let me start up the 2nd instance. I then tried setting up a
table and generating lots of data for it, then... KABOOM:

postgres=# create table things (id serial, things int);
NOTICE: CREATE TABLE will create implicit sequence "things_id_seq"
for serial column "things.id"
CREATE TABLE
postgres=# insert into things (things) select x from
generate_series(1,500000) a(x);
LOG: could not link file "pg_xlog/xlogtemp.28426" to
"pg_xlog/000000010000000000000002" (initialization of log file 0,
segment 2): Operation not supported
STATEMENT: insert into things (things) select x from
generate_series(1,500000) a(x);
PANIC: could not open file "pg_xlog/000000010000000000000002" (log
file 0, segment 2): No such file or directory
STATEMENT: insert into things (things) select x from
generate_series(1,500000) a(x);
PANIC: could not open file "pg_xlog/000000010000000000000002" (log
file 0, segment 2): No such file or directory
PANIC: could not open file "pg_xlog/000000010000000000000002" (log
file 0, segment 2): No such file or directory
The connection to the server was lost. Attempting reset: LOG: server
process (PID 28426) was terminated by signal 6: Abort trap
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
Failed.
!> LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2011-09-08 19:04:47 BST
LOG: database system was not properly shut down; automatic recovery in progress
LOG: consistent recovery state reached at 0/1755268
LOG: redo starts at 0/1755268
LOG: could not open file "pg_xlog/000000010000000000000002" (log file
0, segment 2): No such file or directory
LOG: redo done at 0/1FFFFA8
LOG: last completed transaction was at log time 2011-09-08 19:05:14.098496+01
LOG: could not link file "pg_xlog/xlogtemp.28429" to
"pg_xlog/000000010000000000000002" (initialization of log file 0,
segment 2): Operation not supported
PANIC: could not open file "pg_xlog/000000010000000000000002" (log
file 0, segment 2): No such file or directory
LOG: startup process (PID 28429) was terminated by signal 6: Abort trap
LOG: aborting startup due to startup process failure

Now obviously no-one should ever set up their system in such a
fashion, but some have, as I've witnessed it. I suspect this is
potentially the cause of their continued database corruption and
outages, where expected WAL files aren't in the pg_xlog directory, so
recovery can't finish, and clog files going missing etc.

While I appreciate that this isn't a bug, and that no-one should
actually be setting things up in this way, it does introduce the
ability to hose one's own cluster without realising (until it's
possibly too late).

Would there be a way to prevent this abhorrent scenario from coming
into existence? One idea is to have a configuration option to be
strict about the presence of a pid file in the data directory, and
force manual intervention, but I'm not sure this would solve the
problem in most cases where this problem exists as someone would have
had to specifically sought out the option and set it. It might also
encourage some to just delete the pid file thinking that would make
the nasty errors go away.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-09-08 19:02:32 Re: Protecting against multiple instances per cluster
Previous Message Oleg Bartunov 2011-09-08 18:11:00 Re: WIP: Fast GiST index build