Odd db lockup - investigation advice wanted

From: Marc Munro <marc(at)bloodnok(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Odd db lockup - investigation advice wanted
Date: 2005-11-07 18:09:54
Message-ID: 1131386995.28004.24.camel@bloodnok.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Last week I managed to lock-up and then crash a development database.
I'm going to try to reproduce it today and would like to know what I can
do to further investigate the problem.

I am running Linux 2.6.9-11.ELsmp X86_64 on a Quad Dual-Core Opteron

I have the following postgres RPMs installed:

postgresql-libs-7.4.7-2.RHEL4.1
postgresql-contrib-8.0.3-1PGDG
postgresql-libs-8.0.3-1PGDG
postgresql-server-8.0.3-1PGDG
postgresql-8.0.3-1PGDG

PGDATA is installed on a Netapp network storage device.

We are using slony 1.1.0 for replication.

The (provider) database locked-up after I killed a slony client process
(kill -9) on the subscriber. Psql connections would not respond to \d
and simply locked up. I was able to run a query to check for blocking
locks - this returned no rows. There was a significant test load on the
database at the time.

I stopped the database but was unable to restart it. I was unable to
kill a number of postgres processes and could not release postgres
shared memory. Having decided that the database was toast, I discovered
that I could not even delete the database files, and eventually the only
solution was a full reboot.

This hardware was destined to be put into production in the next two
weeks but this crash has shaken our confidence somewhat. Any advice on
how to further investigate this would be much appreciated

Here is an exerpt from the logs at the time of the failure:
LOG: duration: 4143.996 ms statement: execute wibble_transaction
( '2012416', '3410660', '2005-11-04 17:39:49 -0600', '20005', '3', '0',
'{22000,22011,22001,22002,22003,22004,22005,22006,22007,22008,22009,22010}', '{8,0,0,0,2,1,0,7,2,23,26,0}' )
LOG: duration: 4814.012 ms statement: execute write_wibble
( '2048847',
'{{15000,17660,0},{15001,3522,0},{15002,0,0},{15003,3851,0},{15004,0,0},{15005,0,0},{15006,0,0},{15007,0,0},{15011,0,0},{15012,0,0},{15013,0,0},{15014,0,0},{15015,0,0},{15016,0,0},{15017,0,0},{15018,0,0},{15019,0,0},{15020,0,0},{15021,0,0},{15022,0,0},{15023,0,0},{15024,0,0},{15025,0,0},{15026,0,0},{15048,0,0},{15030,0,0},{15031,0,0},{15032,0,0},{15033,0,0},{15035,0,0},{15036,0,0},{15037,0,0},{15038,0,0},{15040,0,0},{15041,1,0},{15042,0,0},{15043,0,0},{15046,0,0},{15047,0,0},{15049,1000,0},{15050,0,0},{15051,0,0},{15052,0,0}}' )
LOG: unexpected EOF on client connection
LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: received immediate shutdown request
WARNING: terminating connection because of crash of another server
process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
WARNING: terminating connection because of crash of another server
process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.

My lock checking query:

select d.datname || '.' || c.relname as object,
l.transaction as trans, l.pid, l.mode,
case when b.blocked then b.blocker else null end as blocker
from (
select w.pid as pid, h.pid as blocker, 't'::bool as blocked
from pg_locks h, pg_locks w
where h.granted
and not w.granted
and ( (h.relation = w.relation and h.database = w.database)
or h.transaction = w.transaction)
union
select h.pid, null, 'f'::bool as blocked
from pg_locks h, pg_locks w
where h.granted
and not w.granted
and ( (h.relation = w.relation and h.database = w.database)
or h.transaction = w.transaction)
) b,
pg_locks l
left outer join pg_database d
on d.oid = l.database
left outer join pg_class c
on c.oid = l.relation
where l.pid = b.pid
order by l.pid;

Thanks for any suggestions.

__
Marc

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2005-11-07 18:20:32 Re: parameterized limit statements
Previous Message Tom Lane 2005-11-07 17:43:36 Re: parameterized limit statements