Re: the un-vacuumable table

From: "Andrew Hammond" <andrew(dot)george(dot)hammond(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: the un-vacuumable table
Date: 2008-06-27 23:31:05
Message-ID: 5a0a9d6f0806271631p3e7b627obc031b30f445a1ac@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 25, 2008 at 9:57 AM, Andrew Hammond
<andrew(dot)george(dot)hammond(at)gmail(dot)com> wrote:
>
> On Wed, Jun 25, 2008 at 2:58 AM, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>>
>> Andrew Hammond wrote:
>>>
>>> I found this error message in my log files repeatedly:
>>>
>>> Error: failed to re-find parent key in "ledgerdetail_2008_03_idx2" for
>>> deletion target page 64767
>>>
>>> I though "hmm, that index looks broken. I'd better re-create it." So, I
>>> dropped the index and then tried to create a new one to replace it. Which
>>> completely locked up the backend that was running the CREATE TABLE. I ran
>>> truss against the backend in question and it didn't register anything
>>> (except signals 2 and 15 when I tried to cancel the query and kill the
>>> backend respectively). I eventually had to restart the database to get the
>>> CREATE INDEX process to go away (well, to release that big nasty lock).
>>
>> What kind of an index is it? Does "SELECT COUNT(*) from <table>" work?
>
> After the restart I did a count(*) and it worked. A little under 13m rows. So, sequential scans seem to work.
>
>>>
>>> posting here in case there's interest in gathering some forensic data or a
>>> clever suggetion about how I can recover this situation or even some ideas
>>> about what's causing it.
>>
>> Anyway, the current plan is to drop the table and reload it from backup. I'm
>>
>> Yes, please take a filesystem-level backup right away to retain the evidence.
>
> Well, I've already burned our downtime allowance for this month, but we do a regular PITR type backup which hopefully will be sufficient to replicate the problem.
>
>>
>> Could you connect to the hung backend with gdb and get a stacktrace?
>
> The backend is no longer hung (two restarts later). I'll try to reproduce this problem on my workstation (same binary, same OS, libraries etc) using the PITR dump.
>
> Andrew

I tried to restore the PITR backup and it failed.

Jun 27 15:54:30 qadb2 postgres[92517]: [1-1] DEBUG: postmaster:
PostmasterMain: initial environ dump:
Jun 27 15:54:30 qadb2 postgres[92517]: [2-1] DEBUG:
-----------------------------------------
Jun 27 15:54:30 qadb2 postgres[92517]: [3-1] DEBUG: USER=pgsql
Jun 27 15:54:30 qadb2 postgres[92517]: [4-1] DEBUG: MAIL=/var/mail/pgsql
Jun 27 15:54:30 qadb2 postgres[92517]: [5-1] DEBUG:
LD_LIBRARY_PATH=:/usr/local/adecn/lib
Jun 27 15:54:30 qadb2 postgres[92517]: [6-1] DEBUG: HOME=/usr/local/pgsql
Jun 27 15:54:30 qadb2 postgres[92517]: [7-1] DEBUG: PGLIB=/usr/local/lib
Jun 27 15:54:30 qadb2 postgres[92517]: [8-1] DEBUG: PS1=[QA2] \u(at)\h:\w\$
Jun 27 15:54:30 qadb2 postgres[92517]: [9-1] DEBUG: BLOCKSIZE=K
Jun 27 15:54:30 qadb2 postgres[92517]: [10-1] DEBUG: TERM=xterm
Jun 27 15:54:30 qadb2 postgres[92517]: [11-1] DEBUG:
PGSYSCONFDIR=/usr/local/etc/postgresql
Jun 27 15:54:30 qadb2 postgres[92517]: [12-1] DEBUG:
PGLOCALEDIR=/usr/local/share/locale
Jun 27 15:54:30 qadb2 postgres[92517]: [13-1] DEBUG:
Jun 27 15:54:30 qadb2 postgres[92517]: [13-2]
PATH=/usr/local/adecn/bin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/X11R6/bin
Jun 27 15:54:30 qadb2 postgres[92517]: [13-3] :/usr/local/pgsql/bin
Jun 27 15:54:30 qadb2 postgres[92517]: [14-1] DEBUG:
ADECN_HOME=/usr/local/adecn
Jun 27 15:54:30 qadb2 postgres[92517]: [15-1] DEBUG: SHELL=/bin/sh
Jun 27 15:54:30 qadb2 postgres[92517]: [16-1] DEBUG:
Jun 27 15:54:30 qadb2 postgres[92517]: [16-2] CLASSPATH=
(deleted a bunch of lines)
Jun 27 15:54:30 qadb2 postgres[92517]: [17-1] DEBUG:
PYTHONPATH=/usr/local/adecn/python:/usr/local/adecn/lib/python:/usr/local/adecn/api/client/python
Jun 27 15:54:30 qadb2 postgres[92517]: [18-1] DEBUG: FTP_PASSIVE_MODE=YES
Jun 27 15:54:30 qadb2 postgres[92517]: [19-1] DEBUG:
PGDATA=/var/db/adecn/adecndb
Jun 27 15:54:30 qadb2 postgres[92517]: [20-1] DEBUG: LC_COLLATE=C
Jun 27 15:54:30 qadb2 postgres[92517]: [21-1] DEBUG: LC_CTYPE=C
Jun 27 15:54:30 qadb2 postgres[92517]: [22-1] DEBUG: LC_MESSAGES=C
Jun 27 15:54:30 qadb2 postgres[92517]: [23-1] DEBUG: LC_MONETARY=C
Jun 27 15:54:30 qadb2 postgres[92517]: [24-1] DEBUG: LC_NUMERIC=C
Jun 27 15:54:30 qadb2 postgres[92517]: [25-1] DEBUG: LC_TIME=C
Jun 27 15:54:30 qadb2 postgres[92517]: [26-1] DEBUG:
-----------------------------------------
Jun 27 15:54:30 qadb2 postgres[92518]: [27-1] DEBUG: invoking
IpcMemoryCreate(size=92938240)
Jun 27 15:54:30 qadb2 postgres[92518]: [28-1] DEBUG: max_safe_fds =
983, usable_fds = 1000, already_open = 7
Jun 27 15:54:30 qadb2 postgres[92519]: [29-1] LOG: database system
was interrupted at 2008-06-25 03:01:02 PDT
Jun 27 15:54:30 qadb2 postgres[92519]: [30-1] LOG: starting archive recovery
Jun 27 15:54:30 qadb2 postgres[92519]: [31-1] LOG: restore_command =
"cp -p /usr/tmp/2008-06-25_wals/%f %p"
Jun 27 15:54:30 qadb2 postgres[92519]: [32-1] DEBUG: executing
restore command "cp -p /usr/tmp/2008-06-25_wals/00000001.history
pg_xlog/RECOVERYHISTORY"
Jun 27 15:54:30 qadb2 postgres[92519]: [33-1] DEBUG: could not
restore file "00000001.history" from archive: return code 256
Jun 27 15:54:30 qadb2 postgres[92519]: [34-1] DEBUG: executing
restore command "cp -p
/usr/tmp/2008-06-25_wals/00000001000001D600000078.0055F0B8.backup
Jun 27 15:54:30 qadb2 postgres[92519]: [34-2] pg_xlog/RECOVERYHISTORY"
Jun 27 15:54:30 qadb2 postgres[92519]: [35-1] LOG: restored log file
"00000001000001D600000078.0055F0B8.backup" from archive
Jun 27 15:54:30 qadb2 postgres[92519]: [36-1] DEBUG: executing
restore command "cp -p
/usr/tmp/2008-06-25_wals/00000001000001D600000078
pg_xlog/RECOVERYXLOG"
Jun 27 15:54:31 qadb2 postgres[92519]: [37-1] LOG: restored log file
"00000001000001D600000078" from archive
Jun 27 15:54:31 qadb2 postgres[92519]: [38-1] LOG: checkpoint record
is at 1D6/7855F0B8
Jun 27 15:54:31 qadb2 postgres[92519]: [39-1] LOG: redo record is at
1D6/7855F0B8; undo record is at 0/0; shutdown FALSE
Jun 27 15:54:31 qadb2 postgres[92519]: [40-1] LOG: next transaction
ID: 397171279; next OID: 679516596
Jun 27 15:54:31 qadb2 postgres[92519]: [41-1] LOG: next MultiXactId:
857318; next MultiXactOffset: 1718141
Jun 27 15:54:31 qadb2 postgres[92519]: [42-1] LOG: automatic recovery
in progress
Jun 27 15:54:31 qadb2 postgres[92519]: [43-1] LOG: redo starts at 1D6/7855F108

(I thought this line was interesting)
Jun 27 15:54:31 qadb2 postgres[92519]: [44-1] PANIC: could not open
relation 1663/16386/679439393: No such file or directory

Jun 27 15:54:31 qadb2 postgres[92518]: [29-1] DEBUG: forked new
backend, pid=92526 socket=8
Jun 27 15:54:31 qadb2 postgres[92526]: [29-1] LOG: connection
received: host=[local]
Jun 27 15:54:31 qadb2 postgres[92526]: [30-1] FATAL: the database
system is starting up
Jun 27 15:54:31 qadb2 postgres[92526]: [31-1] DEBUG: proc_exit(0)
Jun 27 15:54:31 qadb2 postgres[92526]: [32-1] DEBUG: shmem_exit(0)
Jun 27 15:54:31 qadb2 postgres[92526]: [33-1] DEBUG: exit(0)
Jun 27 15:54:31 qadb2 postgres[92518]: [30-1] DEBUG: reaping dead processes
Jun 27 15:54:31 qadb2 postgres[92518]: [31-1] DEBUG: server process
(PID 92526) exited with exit code 0
Jun 27 15:54:31 qadb2 postgres[92518]: [32-1] DEBUG: reaping dead processes
Jun 27 15:54:31 qadb2 postgres[92518]: [33-1] LOG: startup process
(PID 92519) was terminated by signal 6
Jun 27 15:54:31 qadb2 postgres[92518]: [34-1] LOG: aborting startup
due to startup process failure
Jun 27 15:54:31 qadb2 postgres[92518]: [35-1] DEBUG: proc_exit(1)
Jun 27 15:54:31 qadb2 postgres[92518]: [36-1] DEBUG: shmem_exit(1)
Jun 27 15:54:31 qadb2 postgres[92518]: [37-1] DEBUG: exit(1)

I googled to find out what the numbers 1663/16386/679439393 from the
PANIC message mean, but no luck. The last looks like an OID. I also
hunted through our production log files and found the following
interesting error messages.

(On Thursday night)

vacuumdb: vacuuming of database "adecndb" failed: ERROR: could not
write block 209610 of relation 1663/16386/236356665: No space left on
device

CONTEXT: writing block 209610 of relation 1663/16386/236356665

(Friday night, and every night until the incident on Tuesday).

vacuumdb: vacuuming of database "adecndb" failed: ERROR: failed to
re-find parent key in "ledgerdetail_2008_03_idx2" for deletion target
page 64767

Now, the first message is very strange since we have monitoring on the
file system used by the database and it's been hovering at about 18%
space used for the last month. So I can't figure out why we'd get "No
space left on device", assuming the device is actually the disk (which
seems reasonable given the context) and not shared memory.

I also checked our bug system for history and discovered we'd seen
similar problems in the past when connecting to a SAN via a FC switch.
Directly attaching the server to the SAN appeared to solve the
problem. So currently I'm wondering if this might be a hardware / os /
some-other-part-of-the-storage-stack issue manifesting as trashing the
database in some way?

Interestingly, since I dropped the ledgerdetail_2008_03_idx2 index,
the nightly vacuum has been completing successfully.

I'd also like advice on whether I need to schedule another outage and
do a pg_dump / initdb / load (well, more accurately, do a slony move
to the backup box and rebuild the old origin). Or perhaps I should
ask, how urgently.

Andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paesold 2008-06-28 00:32:50 Re: Vacuuming leaked temp tables (once again)
Previous Message Matthew T. O'Connor 2008-06-27 23:16:04 Re: Vacuuming leaked temp tables (once again)