Quick Links

Re: backend for database 'A' crashes/is killed -> corrupt index in database 'B'

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: backend for database 'A' crashes/is killed -> corrupt index in database 'B'
Date:	2011-03-31 07:58:55
Message-ID:	4D94343F.6030302@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On 30.03.2011 21:06, Jon Nelson wrote:
> The short version is that if a postgresql backend is killed (by the Linux
> OOM handler, or kill -9, etc...) while operations are
> taking place in a *different* backend, corruption is introduced in the other
> backend. I don't want to say it happens 100% of the time, but it happens
> every time I test.
>...
>
> Here is how I am reproducing the problem:
>
> 1. Open a psql connection to database A. It may remain idle.
> 2. Wait for an automated process to connect to database B and start
> operations. These operations
> 3. kill -9 the backend for the psql connection to database A.
>
> Then I observe the backends all shutting down and postgresql entering
> recovery mode, which succeeds.
> Subsequent operations on other databases appear fine, but not for
> database B: An index on one of the tables in database B is corrupted.
> It is always the
> same index.
>
> 2011-03-30 14:51:32 UTC LOG: server process (PID 3871) was terminated by
> signal 9: Killed
> 2011-03-30 14:51:32 UTC LOG: terminating any other active server
> processes
> 2011-03-30 14:51:32 UTC WARNING: terminating connection because of crash
> of another server process
> 2011-03-30 14:51:32 UTC DETAIL: The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2011-03-30 14:51:32 UTC HINT: In a moment you should be able to reconnect
> to the database and repeat your command.
> 2011-03-30 14:51:32 UTC databaseB databaseB WARNING: terminating connection
> because of crash of another server process
> 2011-03-30 14:51:32 UTC databaseB databaseB DETAIL: The postmaster has
> commanded this server process to roll back the current transaction and exit,
> because another server process exited abnormally and possibly corrupted
> shared memory.
> 2011-03-30 14:51:32 UTC databaseB databaseB HINT: In a moment you should be
> able to reconnect to the database and repeat your command.
> 2011-03-30 14:51:32 UTC LOG: all server processes terminated;
> reinitializing
> 2011-03-30 14:51:32 UTC LOG: database system was interrupted; last known
> up at 2011-03-30 14:46:50 UTC
> 2011-03-30 14:51:32 UTC databaseB databaseB FATAL: the database system is
> in recovery mode
> 2011-03-30 14:51:32 UTC LOG: database system was not properly shut down;
> automatic recovery in progress
> 2011-03-30 14:51:32 UTC LOG: redo starts at 301/1D328E40
> 2011-03-30 14:51:33 UTC databaseB databaseB FATAL: the database system is
> in recovery mode
> 2011-03-30 14:51:34 UTC LOG: record with zero length at 301/1EA08608
> 2011-03-30 14:51:34 UTC LOG: redo done at 301/1EA08558
> 2011-03-30 14:51:34 UTC LOG: last completed transaction was at log time
> 2011-03-30 14:51:31.257997+00
> 2011-03-30 14:51:37 UTC LOG: autovacuum launcher started
> 2011-03-30 14:51:37 UTC LOG: database system is ready to accept
> connections
> 2011-03-30 14:52:05 UTC databaseB databaseB ERROR: index "<elided>"
> contains unexpected zero page at block 0
> 2011-03-30 14:52:05 UTC databaseB databaseB HINT: Please REINDEX it.
>
> What's more, I can execute a 'DELETE from tableB' (where tableB is the
> table that is the one with the troublesome index) without error, but
> when I try to *insert* that is when I get a problem. The index is a
> standard btree index. The DELETE statement has no where clause.

Can you provide a self-contained test script to reproduce this?

Is the corruption always the same, ie. "unexpected zero page at block 0" ?

> My interpretation of these values is that the drives themselves have
> their write caches disabled.

Ok. It doesn't look like a hardware issue, as there's no OS crash involved.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

backend for database 'A' crashes/is killed -> corrupt index in database 'B' at 2011-03-30 18:06:56 from Jon Nelson

Responses

Re: backend for database 'A' crashes/is killed -> corrupt index in database 'B' at 2011-03-31 11:41:32 from Jon Nelson
Re: backend for database 'A' crashes/is killed -> corrupt index in database 'B' at 2011-08-02 14:35:00 from Jon Nelson

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Noah Misch	2011-03-31 10:06:49	Re: BUG #5856: pg_attribute.attinhcount is not correct.
Previous Message	Julia Jacobson	2011-03-30 21:44:12	Re: BUG #5960: No rule to make target 'libpq.a', needed by 'all-static-lib'