Re: Postgres "invalid page header"

From: Hubert Fröhlich <hubert(dot)froehlich(at)bvv(dot)bayern(dot)de>
To: pgsql-general(at)postgresql(dot)org
Cc: fgp(at)phlo(dot)org, carl(dot)anderson(at)co(dot)fulton(dot)ga(dot)us
Subject: Re: Postgres "invalid page header"
Date: 2004-07-22 10:19:06
Message-ID: 40FF949A.9050506@bvv.bayern.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi list,
hello Carl, Florian,

maybe you remember my posting from 18.06.2004 and your reaction(-s) The
problem is reproducing my error that does not happen very often, i.e.
isolating the problem ... As Tom Lane put it: "I would very much like to
see a test case that reliably produces "invalid page header" ..."

Meanwhile, I was able to reproduce the error (and others) with some very
basic scripts enclosed. If you are interested in the problem, feel free
to use them (No liability ...)

The basic doing is

1) prepCrashtest.sh sets up SQL files for creating & populating (via
COPY) a database and doing a big update in 1 transaction. You can scale
it by varying the numberOfRows parameter which sets the size of the only
table (this may take a while because of random generation via openssl).
There is only one table just with characters and integers and some
indexes. No special datatypes, just very basic SQL.

2) multiCrashtest.sh starts working on several databases at the same
time on each db:
a) creating, populating, dumping, vcuuming the db and performing a
"big" update operation (by singleCrashtest.sh),
b) all that in several cycles (5 to 10 seems absolutely enough).

You can scale this by varying the number of databases accessed
simultaneously

E.g. On a client (Athlon 1400+, 512 RAM, XFS, SuSE Linux 8.1 with 2.4.19
kernel), I regularly crash when using
numberOfRows=300000 and ndatabases=3
I regularly get crashes like

ERROR: could not access status of transaction 1497515332
ERROR: could not open segment 1 of relation "idx_vmpnfl_obnr" (...
ERROR: invalid page header in block 25190 of relation "vmpnfl"
PANIC: right sibling's left-link doesn't match
ERROR: could not access status of transaction 188219392

(even PANIC!)
Same holds for an even smaller notebook .

On my big box (see below) i'm still testing with larger values. But as
you can see, the error can be reproduced on a smaller scale with a
smaller box.

The postgresql.conf settings in all cases are somewhat standard, and I
don't believe I can influence the behavior by using different settings.
I am not too familiar with postgres internals, let alone the C coding,
but I guess that there might be some buffer overflow?!

As the hardware on testing was very different, Meanwhile, I dont believe
this is a hardware issue; but this is just a "feeling", no expert's
opinion.

The crashes seem to happen only when *** several *** databases are
accessed simultaneously on the same box - and only when there is heavy
load.

The three scripts are enclosed below in plaintext to avoid cutting of
attachments (sorry for bad line formatting). The scripts may be VERY
BASIC but I hope they can shed some light on the issue

<snip>
#!/bin/bash
# prepCrashtest.sh
# prepare a SQL file containing a COPY with $1 randomly generated rows
# ---------------------------------------------------------------------
#

THIS=`basename $0`
if [ $# != 1 ]
then
echo "usage: $THIS numberOfRows"
exit 1
fi
nrows=$1

make_copy_row()
# ----------------------------------------------------------------------
# create a row for COPY command randomly
{
obnr=`openssl rand -base64 20 | cut -c1-16`
pkn=`openssl rand -base64 25 | cut -c1-30`
echo -e "$obnr\t0\t2147483647\t10\t$pkn\t\\N\t0"
}

make_copy_nrows()
{
#
----------------------------------------------------------------------------
# create a COPY command for a table with $nrows randomly generated rows
echo "COPY vmpnfl (obnr, edat, udat, vma, pkn, dpp, orient) FROM stdin;"
n=0
while [[ $n -lt $nrows ]]
do
make_copy_row
n=`expr $n + 1`
done
echo "\."
}

echo "...create SQL CREATE TABLE file"

echo "CREATE TABLE vmpnfl (obnr character(16), edat integer DEFAULT 0,
udat integer DEFAULT 2147483647, vma smallint, pkn character(30), dpp
smallint, orient double precision DEFAULT 0 );">crashSetup.sql
echo "CREATE INDEX idx_vmpnfl_obnr ON vmpnfl USING btree
(obnr);">>crashSetup.sql
echo "CREATE INDEX idx_vmpnfl_udat ON vmpnfl USING btree
(udat);">>crashSetup.sql
echo "CREATE INDEX idx_vmpnfl_oid ON vmpnfl USING btree
(oid);">>crashSetup.sql

echo "...create SQL UPDATE file"

echo "begin; update vmpnfl set dpp = 1234; commit" >crashUpdate.sql

echo "...create SQL COPY file with $nrows rows ... this may take a
little while"

make_copy_nrows >crashTable.sql

echo "Finished. Try now:"
echo "multiCrashtest.sh ndatabases servername ncycles"
<snip>
#!/bin/bash
# singleCrashtest.sh
# testing a single database on heavy load
# ----------------------------------------------------------------------
# setting up a database with one table and some indexex
# populating the db
# dumping the db somewhere
# vacuuming the db
# performing a big update operation: update of all columns in one row

THIS=`basename $0`
if [ $# != 3 ]
then
echo "usage: $THIS dbname servername ncycles"
exit 1
fi

database=$1
server=$2
ncycles=$3
dumpfile=dump$database.sql

echo "=== test on database $database, server $server, $ncycles R/W
cycles: "`date +%Y%m%d%H%M%S`" ==="

cycle=0

while [[ $cycle -lt $ncycles ]]
do
cycle=`expr $cycle + 1`
echo "=== cycle $cycle: "`date +%Y%m%d%H%M%S`" ==="

dropdb -h $server $database 2>&1
createdb -h $server $database 2>&1
psql -h $server $database -f crashSetup.sql 2>&1| grep -v CREATE #
reducing debug output

psql -h $server $database -f crashTable.sql 2>&1 | grep -v INSERT #
reducing debug output
sleep 3
echo "--- cycle $cycle: Setup "`date +%Y%m%d%H%M%S`" ---"

pg_dump -h $server $database >$dumpfile 2>&1
sleep 3
echo "--- cycle $cycle: Dump "`date +%Y%m%d%H%M%S`" ---"
rm $dumpfile

echo -n "Tuples: "
psql $database -h $server -P tuples_only -c "select count(*) from vmpnfl"
psql $database -h $server -c "vacuum verbose analyze" 2>&1
sleep 3
echo "--- cycle $cycle: Vacuum "`date +%Y%m%d%H%M%S`" ---"

psql -h $server $database -f crashUpdate.sql 2>&1
sleep 3
echo "--- cycle $cycle: BIG Update "`date +%Y%m%d%H%M%S`" ---"

done

<snip>
#!/bin/bash
# multiCrashtest.sh
# testing databases on heavy load:
# Reading/writing into several databases simultaneously:
# this can be looped several times
# load can be increased
# by increasing ndb = no. of databases being accessed simultaneously
# or by increasing the number of rows in prepCrashtest.sh
# ----------------------------------------------------------------------
# ndatabases
# servername = name of database server
# ncycles =

THIS=`basename $0`
if [ $# != 3 ]
then
echo "usage: $THIS ndatabases servername ncycles"
exit 1
fi

ndb=$1
server=$2
ncycles=$3

# ----------------------------------------------------------------------
echo "...loop over $ndb databases on server $servername: $ncycles R/W
cycles per database"

db=0
while [[ $db -lt $ndb ]]
do
# putting job to background
singleCrashtest.sh crash${db} localhost $ncycles >log_crash${db} &
echo " ... process for database crash${db} started"
sleep 5
db=`expr $db + 1`
done
exit

<snip>

>
> Hubert,
>
> I too have noticed such an issue
>
> Would you be willing to try to load a large table (from a SQL file).
> It reliably demonstrates the behavior "invalid page header"
> after completion (for me)
>
> It seems to me that while there are WAL logs being flushed, a certain
> type of failed SQL insert will cause a page to be marked wrong.
>
> C.
>
>> Hi list,
>>
>> I am working with 7.4.1 under Linux (SuSE 8.1) The server is a HP
>> ProLiant DL 380-G3, 2x Intel Pentium4-Xeon, 2.8 GHz, 4 GB memory and a
>> RAID 5 system with ca. 500 GB diskspace (xfs file system)
>>
>>
>> When doing big transactions or changes (UPDATE several million rows in
>> one step) on a database with ca. 50 GB diskspace and ca 15 million
>> entries, PostGIS and lots of indexes, I get errors like
>>
>> ERROR: invalid page header in block 582024 of relation ...
>>
>> The error does not occur regularly. It seems to me that the error is
>> related with heavy load and heavy I/O . CPU and memory does not seem
>> the problem, according to my hotsanic tools.
>>
>> I have worked through
>>
>> http://archives.postgresql.org/pgsql-general/2003-11/msg01288.php
>> http://archives.postgresql.org/pgsql-admin/2003-09/msg00423.php
>> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&threadm=20030922162322.E12708%40quartz.newn.cam.ac.uk&rnum=8&prev=/groups%3Fq%3Dpg_filedump%26hl%3Den%26lr%3D%26ie%3DUTF-8%26selm%3D20030922162322.E12708%2540quartz.newn.cam.ac.uk%26rnum%3D8
>>
>>
>>
>> which suggest to me that I have run into hardware trouble (disk?
>> memory?!) But I supposed that my server was not SO bad for this
>> database application...
>>
>> How to fix that? (I mean - not to repair the tables or pages but to
>> get a stable one? ) Trying another sotrage hardware?!
>>
--
-------------------------------------------------------------------------------
Dr.-Ing. Hubert Fröhlich
Bezirksfinanzdirektion München
Alexandrastr. 3, D-80538 München, GERMANY
Tel. :+49 (0)89 / 2190 - 2980
Fax :+49 (0)89 / 2190 - 2997
hubert dot froehlich at bvv dot bayern dot de

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jean-Guillaume LALANNE 2004-07-22 13:02:21 2 postgresql processes on the same machine ?
Previous Message Oleg Bartunov 2004-07-22 09:58:38 Re: [OpenFTS-general] AW: tsearch2, ispell, utf-8 and