Re: PosgreSQL is crashing with a signal 11 - Bug?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: PosgreSQL is crashing with a signal 11 - Bug?
Date: 2004-09-07 22:30:34
Message-ID: 5644.1094596234@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Rafael Martinez <r(dot)m(dot)guerrero(at)usit(dot)uio(dot)no> writes:
>> Hmm, what do you get from "x/10 3054556648" ?

> (gdb) x/10 3054556648
> 0xb610d5e8: 0x2f00000c 0x00000002 0x30170000 0x020c6172
> 0xb610d5f8: 0x00000000 0x00000000 0x00ae0000 0x0006002b
> 0xb610d608: 0x2f1c0913 0x0404b70b

Well, that's certainly not a sensible first word for a numeric field;
the first word should be a length and this obviously isn't.

A reasonable theory at this point is that the data on disk for this
table have gotten corrupted, probably in the way of a bad length value
for whatever field(s) lie between the two that are being extracted here.
That could result in a miscomputed address for the next field, which
seems to be what we're looking at.

What I would suggest doing next is backtracking to find out which
physical tuple this is on which disk page, and then dumping that out
with pg_filedump (or your tool of choice) so that we can verify or
disprove the hypothesis of bad stored data. If it is bad data, we'll
want to examine the whole page anyway to see if we can see any pattern
of corruption.

You should be able to find out the physical tuple involved by looking at
the "ecxt_scantuple" field of ExecTargetList's econtext parameter. Its
"val" field should point to something like this:

(gdb) p *econtext->ecxt_scantuple->val
$3 = {t_len = 276, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 1},
t_tableOid = 863135, t_datamcxt = 0x0, t_data = 0xc2c0fc48}

t_tableOid is the source table OID, ip_blkid is the page number (divided
into high and low 16-bit halves for arcane reasons), and ip_posid is the
tuple number on that page. You can also look at *t_data for additional
confirmation of what you are dealing with:

(gdb) p *econtext->ecxt_scantuple->val->t_data
$4 = {t_choice = {t_heap = {t_xmin = 42833, t_cmin = 0, t_xmax = 863136,
t_field4 = {t_cmax = 0, t_xvac = 0}}, t_datum = {datum_len = 42833,
datum_typmod = 0, datum_typeid = 863136}}, t_ctid = {ip_blkid = {
bi_hi = 0, bi_lo = 0}, ip_posid = 1}, t_natts = 16, t_infomask = 2320,
t_hoff = 32 ' ', t_bits = ""}

I'm using CVS tip to prepare this example, so the field layout is not
the same as what you'll see in 7.4, but there will be a t_ctid field
and it will probably have the same contents as what you saw in the
scantuple struct.

Once you have the table OID, discover its file node number:

regression=# select relfilenode from pg_class where oid = 863135;
relfilenode
-------------
863135
(1 row)

(These will often be the same, but don't assume so without verifying.)
And look up your database OID:

regression=# select oid from pg_database where datname = 'mydb';

Now the file you want to look at is $PGDATA/base/dboid/relfilenode.

If you are using pg_filedump (see http://sources.redhat.com/rhdb/)
then I'd recommend a command along the lines of

pg_filedump -i -f -R pagenum $PGDATA/base/dboid/relfilenode

to dump the page in the most useful format.

We'll need to know the table schema ("\d tabname") also to interpret
what's in the dump.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Marcio Balieiro 2004-09-07 23:56:02 PostgreSQL 8.0 - very good
Previous Message Rafael Martinez 2004-09-07 21:48:37 Re: PosgreSQL is crashing with a signal 11 - Bug?