Strange errors after some DB problems

From: "Dominic J(dot) Eidson" <sauron(at)the-infinite(dot)org>
To: <pgsql-admin(at)postgresql(dot)org>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Strange errors after some DB problems
Date: 2006-01-20 22:11:51
Message-ID: Pine.LNX.4.33.0601201551510.30736-100000@morannon.the-infinite.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-general


Earlier today we experienced some problems with one of our PG
installations - running 8.0.3.

It started with the DB's write performance being fairly slow (this is how
we noticed it), and after some research, I was seeeing severeal of the
backend processes growing in their memory usage, to someplace around 4-6GB
RSS. (Machine has 8GB +1GB swap). So then they would swap-thrash until the
kernel killed off a process, at which point I'd be able to issue a pg_ctl
shutdown.

Looking in the logs after we got the machine back to where it's
responsive, I saw the following errors in the log (these are all from
today):

ERROR: relation with OID 97737136 does not exist
CONTEXT: SQL statement "INSERT INTO _netadmin.sl_log_1 (log_origin,
log_xid, log_tableid, log_actionseq, log_cmdtype, log_cmddata) VALUES (1,
$1, $2, nextval('_netadmin.sl_action_seq'), $3, $4);"
ERROR: xlog flush request 33/553D66E0 is not satisfied --- flushed only
to 32/FDECF4D8
CONTEXT: writing block 4945 of relation 1663/17230/96228095
ERROR: xlog flush request 33/553D66E0 is not satisfied --- flushed only
to 32/FDECF4D8
CONTEXT: writing block 4945 of relation 1663/17230/96228095
WARNING: could not write block 4945 of 1663/17230/96228095
DETAIL: Multiple failures --- write error may be permanent.

.. these occur several times - the first one seems to occur ever since we
enabled slony-1 on some replication sets on the server. (_netadmin.sl* is
slony stuff). The latter error, I'm not sure what would cause it.

At one point the following errors show up:

ERROR: could not open segment 1 of relation 1663/17230/96242110 (target
block 61997056): No such file or directory
ERROR: could not open segment 1 of relation 1663/17230/96242110 (target
block 61997056): No such file or directory
ERROR: could not open segment 1 of relation 1663/17230/96242110 (target
block 775304242): No such file or directory
ERROR: could not open segment 1 of relation 1663/17230/96242110 (target
block 1680881205): No such file or directory
ERROR: could not open segment 1 of relation 1663/17230/96242110 (target
block 1680881205): No such file or directory

.. several more lines, with different target block numbers

At one poin, when trying to run a vacuum on one of the tables, we got the
following errors:

2006-01-20 13:06:01 CST [local] WARNING: relation "inv_node" page 4947 is
uninitialized --- fixing
2006-01-20 13:06:01 CST [local] WARNING: relation "inv_node" page 4948 is
uninitialized --- fixing
2006-01-20 13:06:01 CST [local] WARNING: relation "inv_node" page 4949 is
uninitialized --- fixing
2006-01-20 13:06:01 CST [local] WARNING: relation "inv_node" page 4951 is
uninitialized --- fixing
2006-01-20 13:06:01 CST [local] WARNING: relation "inv_node" page 4952 is
uninitialized --- fixing

... keeps going ....

2006-01-20 13:06:01 CST [local] WARNING: relation "inv_node" page 11959
is uninitialized --- fixing
2006-01-20 13:06:01 CST [local] WARNING: relation "inv_node" page 11992
is uninitialized --- fixing
2006-01-20 13:06:01 CST [local] WARNING: relation "inv_node" page 12118
is uninitialized --- fixing
2006-01-20 13:06:04 CST [local] ERROR: failed to re-find parent key in
"inv_node_node_mac_key"

(inv_node_node_mac_key is the primary index on the inv_node table.)

When looking closer at the table (and some other tables), we found that
despite having UNIQUE indices on the tables, several of them had duplicate
keys for the index field.

We are currently in the process of cleaning up after the mess, but since
this is a production system, we want to try to find out what happened.

Several people online had mentioned either being out of disk space, or
drive problems - the DB is on a 300GB partition, using barely 10GB of disk
space - and the server doesn't show any indications of there being
hardware problems...

I can provide you with the full log (616K, ~13k lines) upon request.

- d.

--
Dominic J. Eidson
"Baruk Khazad! Khazad ai-menu!" - Gimli
-------------------------------------------------------------------------------
http://www.the-infinite.org/

Browse pgsql-admin by date

  From Date Subject
Next Message Chris Browne 2006-01-20 22:20:09 Re: [Slony1-general] "Blueprints for High Availability"
Previous Message Dominic J. Eidson 2006-01-20 21:13:19 subscribe

Browse pgsql-general by date

  From Date Subject
Next Message Steinar H. Gunderson 2006-01-20 22:16:55 Re: [GENERAL] Creation of tsearch2 index is very slow
Previous Message Bricklen Anderson 2006-01-20 22:06:18 Re: Page-Level Encryption