Re: Partially corrupted table

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Filip Hrbek" <filip(dot)hrbek(at)plz(dot)comstar(dot)cz>
Cc: pgsql-bugs(at)postgreSQL(dot)org
Subject: Re: Partially corrupted table
Date: 2006-08-29 23:33:45
Message-ID: 19402.1156894425@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Well, it's a corrupt-data problem all right. The tuple that's
causing the problem is on page 1208, item 27:

Item 27 -- Length: 240 Offset: 1400 (0x0578) Flags: USED
XMIN: 5213 CMIN: 140502 XMAX: 0 CMAX|XVAC: 0
Block Id: 1208 linp Index: 27 Attributes: 29 Size: 28
infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)

0578: 5d140000 d6240200 00000000 00000000 ]....$..........
0588: 0000b804 1b001d00 02091c00 0e000000 ................
0598: 02000000 42020000 23040000 6b000000 ....B...#...k...
05a8: 02000000 6a010000 0d000000 42020000 ....j.......B...
05b8: 02000000 10000000 08000000 00000400 ................
05c8: 08000000 00000400 0a000000 ffff0400 ................
05d8: 78050000 0a000000 00000200 03000000 x...............
05e8: 08000000 00000300 08000000 00000400 ................
05f8: 08000000 00000400 08000000 00000400 ................
0608: 08000000 00000200 08000000 00000300 ................
0618: 08800000 00000400 08000000 00000400 ................
^^^^^^^^
0628: 08000000 00000400 08000000 00000200 ................
0638: 08000000 00000300 08000000 00000400 ................
0648: 08000000 00000400 18000000 494e565f ............INV_
0658: 41534153 5f323030 36303130 31202020 ASAS_20060101

The underlined word is a field length word that evidently should contain
8, but contains hex 8008. This causes the tuple-data decoder to step
way past the end of the tuple and off into never-never land. Since the
results will depend on which shared buffer the page happens to be in and
what happens to be at the address the step lands at, the inconsistent
results from try to try are not so surprising.

The next question is how did it get that way. In my experience a
single-bit flip like that is most likely to be due to flaky memory,
though bad motherboards or cables are not out of the question either.
I'd recommend some thorough hardware testing on the original machine.

It seems there's only the one bad bit; I did

dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)';
DELETE 1

and then was able to copy the table repeatedly without crash. I'd
suggest doing that and then reconstructing the deleted tuple from
the above dump.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2006-08-30 00:18:36 Re: Partially corrupted table
Previous Message Charlie Savage 2006-08-29 23:13:40 Re: BUG #2594: Gin Indexes cause server to crash on Windows