Re: Partially corrupted table

From: "Filip Hrbek" <filip(dot)hrbek(at)plz(dot)comstar(dot)cz>
To: <pgsql-bugs(at)postgreSQL(dot)org>
Subject: Re: Partially corrupted table
Date: 2006-08-30 08:18:46
Message-ID: 002701c6cc0c$ea3ba890$1e03a8c0@fhrbek
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tom, thank you very much for your excellent and fast analysis (I mean it
seriously, I am comparing your help to IBM Informix commercial support
:-) ).

It is possible that the corruption was caused by a HW problem at customer's
server, and then this problem appeared also at our development environment
because of the data already beeing corrupted. I will recommend the customer
to make some memory tests.

We are using PostgreSQL at 14 customer servers for almost 5 years and this
is the first time it crashed - and perhaps due to a HW problem. Great work!

Regards
Filip Hrbek

----- Original Message -----
From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Filip Hrbek" <filip(dot)hrbek(at)plz(dot)comstar(dot)cz>
Cc: <pgsql-bugs(at)postgreSQL(dot)org>
Sent: Wednesday, August 30, 2006 1:33 AM
Subject: Re: [BUGS] Partially corrupted table

> Well, it's a corrupt-data problem all right. The tuple that's
> causing the problem is on page 1208, item 27:
>
> Item 27 -- Length: 240 Offset: 1400 (0x0578) Flags: USED
> XMIN: 5213 CMIN: 140502 XMAX: 0 CMAX|XVAC: 0
> Block Id: 1208 linp Index: 27 Attributes: 29 Size: 28
> infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)
>
> 0578: 5d140000 d6240200 00000000 00000000 ]....$..........
> 0588: 0000b804 1b001d00 02091c00 0e000000 ................
> 0598: 02000000 42020000 23040000 6b000000 ....B...#...k...
> 05a8: 02000000 6a010000 0d000000 42020000 ....j.......B...
> 05b8: 02000000 10000000 08000000 00000400 ................
> 05c8: 08000000 00000400 0a000000 ffff0400 ................
> 05d8: 78050000 0a000000 00000200 03000000 x...............
> 05e8: 08000000 00000300 08000000 00000400 ................
> 05f8: 08000000 00000400 08000000 00000400 ................
> 0608: 08000000 00000200 08000000 00000300 ................
> 0618: 08800000 00000400 08000000 00000400 ................
> ^^^^^^^^
> 0628: 08000000 00000400 08000000 00000200 ................
> 0638: 08000000 00000300 08000000 00000400 ................
> 0648: 08000000 00000400 18000000 494e565f ............INV_
> 0658: 41534153 5f323030 36303130 31202020 ASAS_20060101
>
> The underlined word is a field length word that evidently should contain
> 8, but contains hex 8008. This causes the tuple-data decoder to step
> way past the end of the tuple and off into never-never land. Since the
> results will depend on which shared buffer the page happens to be in and
> what happens to be at the address the step lands at, the inconsistent
> results from try to try are not so surprising.
>
> The next question is how did it get that way. In my experience a
> single-bit flip like that is most likely to be due to flaky memory,
> though bad motherboards or cables are not out of the question either.
> I'd recommend some thorough hardware testing on the original machine.
>
> It seems there's only the one bad bit; I did
>
> dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)';
> DELETE 1
>
> and then was able to copy the table repeatedly without crash. I'd
> suggest doing that and then reconstructing the deleted tuple from
> the above dump.
>
> regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bernhard Weisshuhn 2006-08-30 09:59:05 Re: BUG #2594: Gin Indexes cause server to crash on Windows
Previous Message Kris Jurka 2006-08-30 05:46:21 Re: BUG #2593: Improper implimentation of SQLException