Re: Duplicate values found when reindexing unique index

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mason Hale" <masonhale(at)gmail(dot)com>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Duplicate values found when reindexing unique index
Date: 2007-12-31 22:16:19
Message-ID: 9866.1199139379@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

"Mason Hale" <masonhale(at)gmail(dot)com> writes:
>> If you'd send them to me privately, I'd be interested.

> I will send these to you right away.

Okay, the 00058 file looks pretty sane, except for what we already knew
about the first page being overwritten with a much later shutdown
checkpoint. All the other page headers are as-expected.

The 00059 file, on the other hand, is just plain wacko. Here are its
first few page headers (or what should be page headers):

0000000 d05e 0001 0001 0000 006b 0000 6000 69dc
0020000 d05e 0001 0001 0000 006b 0000 8000 69dc
0040000 d05e 0001 0001 0000 006b 0000 a000 69dc
0060000 d05e 0001 0001 0000 006b 0000 c000 69dc
0100000 d05e 0001 0001 0000 006b 0000 e000 69dc
0120000 d05e 0001 0001 0000 006b 0000 0000 69dd
0140000 0018 93cc 0018 9414 0018 9588 0018 957c
0160000 000c d63c 0bc7 0009 3c17 0009 000c d63c
0200000 000c b55e 08f7 0009 1373 004b 000c b55e
0220000 2f11 0042 000c 6cc3 0bac 0009 2e14 0043
0240000 4001 0008 c775 0060 0010 d61c 10c3 0c40
0260000 0901 0009 3719 001c 000c 26fa 0901 0009
0300000 0b95 0008 ab38 004b 000c 40b4 0b95 0008
0320000 4001 74de 0002 74e0 0002 0000 0000 0001
0340000 0018 9c48 0018 9c3c 0018 9c30 0018 9c24
0360000 b715 0026 0010 89a1 97e3 0549 4001 0002
0400000 0905 0009 2fc1 0031 000c a3aa 0905 0009
0420000 317c 0015 000c 7017 0bbd 0009 315f 0038
0440000 0018 960c 0018 9600 0018 95f4 0018 95e8
0460000 0018 8f64 0018 8f58 0018 8f4c 0018 8f40

The first six look sane except that they shouldn't be in this file,
and in fact shouldn't be at the front of any WAL file. After that it
goes into a long stretch of what seems to be just plain garbage data.
Here's a sample chosen at random:

0058daa0: 432b 63b4 cc3f 1200 6525 3e00 1000 81d2 C+c..?..e%>.....
0058dab0: 432b 63b4 cc3f 1200 2a16 0600 1000 39c8 C+c..?..*.....9.
0058dac0: 7a28 3eb4 cc3f 1200 7e14 1600 1000 39c8 z(>..?..~.....9.
0058dad0: 7a28 3eb4 cc3f 1200 7e14 1500 1000 39c8 z(>..?..~.....9.
0058dae0: 7a28 3eb4 cc3f 1200 7e14 1400 1000 39c8 z(>..?..~.....9.
0058daf0: 7a28 3eb4 cc3f 1200 7e14 1300 1000 39c8 z(>..?..~.....9.
0058db00: 7a28 3eb4 cc3f 1200 7e14 1200 1000 39c8 z(>..?..~.....9.
0058db10: 7a28 3eb4 cc3f 1200 2c09 4600 1000 f9a1 z(>..?..,.F.....
0058db20: 8b0a 7cb4 cc3f 1100 73e0 4c00 1000 b0ca ..|..?..s.L.....
0058db30: d011 4ab4 cc3f 1100 06c2 0700 1000 f777 ..J..?.........w
0058db40: 1a05 aab4 cc3f 1100 dfaa 4600 1000 82df .....?....F.....
0058db50: 14ac 91b4 cc3f 1100 07a4 0f00 1000 0df4 .....?..........
0058db60: f025 8cb4 cc3f 1100 19a3 4f00 1000 0df4 .%...?....O.....
0058db70: f025 8cb4 cc3f 1100 19a3 4e00 1000 0df4 .%...?....N.....
0058db80: f025 8cb4 cc3f 1100 19a3 4d00 1000 0df4 .%...?....M.....
0058db90: f025 8cb4 cc3f 1100 19a3 4c00 1000 0df4 .%...?....L.....
0058dba0: f025 8cb4 cc3f 1100 19a3 4b00 1000 0df4 .%...?....K.....
0058dbb0: f025 8cb4 cc3f 1100 069f 1e00 1000 8fc1 .%...?..........
0058dbc0: bcc4 9bb4 cc3f 1100 ae9d 0300 1000 8fc1 .....?..........
0058dbd0: bcc4 9bb4 cc3f 1100 8d82 4900 1000 5e2d .....?....I...^-
0058dbe0: 7dc4 79b4 cc3f 1100 8c82 2800 1000 5e2d }.y..?....(...^-
0058dbf0: 7dc4 79b4 cc3f 1100 8c82 2700 1000 5e2d }.y..?....'...^-
0058dc00: 7dc4 79b4 cc3f 1100 8c82 2600 1000 5e2d }.y..?....&...^-
0058dc10: 7dc4 79b4 cc3f 1100 8c82 2500 1000 5e2d }.y..?....%...^-
0058dc20: 7dc4 79b4 cc3f 1100 007d 4100 1000 5e2d }.y..?...}A...^-
0058dc30: 7dc4 79b4 cc3f 1100 e27c 4200 1000 5e2d }.y..?...|B...^-
0058dc40: 7dc4 79b4 cc3f 1100 cc7c 0300 1000 e183 }.y..?...|......
0058dc50: 2610 91b4 cc3f 1100 2f6f 1200 1000 9473 &....?../o.....s
0058dc60: d989 8ab4 cc3f 1100 6c33 4800 1000 2e08 .....?..l3H.....
0058dc70: 2178 7db4 cc3f 1100 312d 4100 1000 006f !x}..?..1-A....o

There's a clear 16-byte periodicity to this junk, but it doesn't
look much like anything I'd expect to find in Postgres data.
What it *does* remind me of is certain types of hardware failure
modes, eg a 16-byte-wide DMA path going nuts. Or it could be valid
data from some non-PG application, though I dunno what offhand.

Continuing to dig through the places where WAL page headers ought to be,
there are additional long stretches of seeming sanity, eg here

11060000 d05e 0001 0001 0000 006b 0000 c000 6a00
11100000 d05e 0001 0001 0000 006b 0000 e000 6a00
11120000 d05e 0001 0001 0000 006b 0000 0000 6a01
11140000 d05e 0001 0001 0000 006b 0000 2000 6a01
11160000 d05e 0001 0001 0000 006b 0000 4000 6a01
11200000 d05e 0001 0001 0000 006b 0000 6000 6a01
11220000 d05e 0000 0001 0000 006b 0000 8000 6a01
11240000 d05e 0000 0001 0000 006b 0000 a000 6a01
11260000 d05e 0000 0001 0000 006b 0000 c000 6a01

and here's an interesting stretch

14260000 d05e 0001 0001 0000 006b 0000 c000 6a0d
14300000 d05e 0000 0001 0000 006b 0000 e000 6a0d
14320000 d05e 0001 0001 0000 006b 0000 0000 6a0e
14340000 d05e 0001 0001 0000 006b 0000 2000 6a0e
14360000 d05e 0000 0001 0000 0067 0000 4000 610e
14400000 d05e 0001 0001 0000 0067 0000 6000 610e
14420000 d05e 0001 0001 0000 0067 0000 8000 610e
14440000 d05e 0001 0001 0000 0067 0000 a000 610e
14460000 d05e 0001 0001 0000 0067 0000 c000 610e
14500000 d05e 0001 0001 0000 0067 0000 e000 610e
14520000 d05e 0000 0001 0000 0067 0000 0000 610f
14540000 d05e 0001 0001 0000 0067 0000 2000 610f
14560000 d05e 0001 0001 0000 0067 0000 4000 610f
14600000 d05e 0001 0001 0000 0067 0000 6000 610f
14620000 d05e 0001 0001 0000 0067 0000 8000 610f
14640000 d05e 0001 0001 0000 0067 0000 a000 610f
14660000 d05e 0001 0001 0000 0067 0000 c000 610f
14700000 d05e 0001 0001 0000 0067 0000 e000 610f
14720000 d05e 0001 0001 0000 0065 0000 0000 5910
14740000 d05e 0000 0001 0000 0065 0000 2000 5910
14760000 d05e 0001 0001 0000 0065 0000 4000 5910
15000000 d05e 0001 0001 0000 0065 0000 6000 5910
15020000 d05e 0001 0001 0000 0061 0000 8000 4b10
15040000 d05e 0001 0001 0000 0061 0000 a000 4b10
15060000 d05e 0001 0001 0000 0061 0000 c000 4b10
15100000 d05e 0001 0001 0000 0061 0000 e000 4b10
15120000 d05e 0001 0001 0000 0061 0000 0000 4b11

and ending up here

77320000 d05e 0001 0001 0000 0043 0000 0000 a1da
77340000 d05e 0001 0001 0000 0043 0000 2000 a1da
77360000 d05e 0001 0001 0000 0043 0000 4000 a1da
77400000 d05e 0001 0001 0000 0043 0000 6000 a1da
77420000 d05e 0001 0001 0000 0043 0000 8000 a1da
77440000 d05e 0001 0001 0000 0043 0000 a000 a1da
77460000 d05e 0001 0001 0000 0043 0000 c000 a1da
77500000 d05e 0000 0001 0000 0043 0000 e000 a1da
77520000 d05e 0001 0001 0000 0043 0000 0000 a1db
77540000 d05e 0001 0001 0000 0043 0000 2000 a1db
77560000 d05e 0001 0001 0000 0043 0000 4000 a1db
77600000 d05e 0001 0001 0000 0043 0000 6000 a1db
77620000 d05e 0001 0001 0000 0043 0000 8000 a1db
77640000 d05e 0000 0001 0000 0043 0000 a000 a1db
77660000 d05e 0001 0001 0000 0043 0000 c000 a1db
77700000 d05e 0001 0001 0000 0043 0000 e000 a1db
77720000 d05e 0001 0001 0000 0043 0000 0000 a1dc
77740000 d05e 0001 0001 0000 0043 0000 2000 a1dc
77760000 d05e 0000 0001 0000 0043 0000 4000 a1dc

What we've got here is chunks of stuff that look like they are
WAL data, all right, but they are in the wrong positions in the
wrong files.

This could be the kernel's fault, but I'm wondering whether the
RAID controller is going south. Offhand it seems like the controller
would be the only place that would be likely to explain both the
bogus-data and good-data-in-wrong-place behaviors.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Mason Hale 2007-12-31 22:51:02 Re: Duplicate values found when reindexing unique index
Previous Message Mason Hale 2007-12-31 20:27:54 Re: Duplicate values found when reindexing unique index