Skip site navigation (1) Skip section navigation (2)

Re: repeatable system index corruption on 7.4.2 (SOLVED)

From: Joe Conway <mail(at)joeconway(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>,"Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: repeatable system index corruption on 7.4.2 (SOLVED)
Date: 2004-08-21 11:32:09
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-hackers
Joe Conway wrote:
> Simon Riggs wrote:
>>> Joe Conway writes
>>> I'm seeing the following errors after a few hours of fairly aggressive
>>> bulk load of a database running on Postgres 7.4.2:
>>> When I say aggressive, I mean up to 6 simultaneous COPY processes. It is
>>> different from the issue Tom solved the other day in that we don't get
>>> SIGABORT, just corrupt index pages.
>> OK, problem accepted, but why would you run 6 simultaneous COPYs? 
>> Presumably
>> on > 1 CPU? Sounds like you're hitting the right edge of the index really
>> hard (as well as finding a hole in the logic).
> This is fairly high end hardware -- 4 hyperthreaded CPUs (hence 8 CPUs 
> from the OS perspective), 8GB RAM.
> But in any case, since last report we've reproduced the problem with a 
> single COPY at a time.

I just want to close the loop on this thread. In summary, the problem 
turned out to be related to the logical volume, and NOT Postgres. For 
those interested in the gory detail, read on -- others may safely move 
on ;-)


Here's what we had, in terms of storage layout (i.e. the layout 
susceptible to Postgres system catalog corruption). Note that I'm 
neither the unix admin nor the storage expert -- we had our own unix 
admin, a storage expert from the local VAR, and a storage expert from 
the SAN vendor involved. They decided how to lay this all out -- I'll do 
my best to be accurate in my description.

[-------------     jfs filesystem   -------------]
[-------------     logical volume   -------------]
[-------------      volume group    -------------]
| LUN1 | LUN2 | LUN3 | LUN4 | LUN5 | LUN6 | LUN7 |

LUN[1-7] are actually each comprised of 14 x 73GB x 15K rpm SCSI drives, 
configured (I think) in a RAID 5 array, totaling just under 1TB of 
useable space. The SAN presents each of these arrays, via fibrechannel, 
to the OS as a single, large SCSI LUN. The LUNs are individually 
partitioned using fdisk, a single primary partition on each, and in this 
case the partition was offset 128MB to allow for stripe alignment (more 
on stripes later).

The volume group creates a block device. We used 128MB physical extent 
size to allow for up to an 8TB logical volume.

The logical volume is then created on top of the volume group. We 
created a single, 6.4TB logical volume using 128MB stripes across the 
LUNs in the volume group.

Finally the filesytem is laid on top. We started out with jfs based on 
vendor recommendations (I believe). I'll represent this topology a bit 
more compactly as (SAN-SCSI->SLV->jfs), where SLV stands for striped 
logical volume.


In the above configuration we consistently got data corruption (as 
evidenced by bad system indexes) regardless of the Postgres 
configuration, and with 1 or 6 parallel data loads. We then changed the 
filesystem to xfs. With an xfs filesystem, we got substantially farther 
into our data load, but corruption did eventually occur.

After a variety of successful data loads (i.e. no corruption) on other 
volumes, with different topologies, we decided that the problem was 
related to either the logical volume level, or the SAN hardware itself.

So we deleted the logical volume and the volume group. Then we 
re-partitioned the LUNs -- this time without the offset for stripe 
alignment. The volume group was rebuilt using 128MB extents again, but 
the logical volume was build using concatenated, instead of striped, 
LUNs. Finally we formatted with xfs. That is, (SAN-SCSI->CLV->xfs), 
where CLV is concatenated logical volume.

(SAN-SCSI->CLV->xfs) worked with no data corruption. We're not entirely 
sure why the original configuration was a problem, but the SAN vendor 
has agreed to try to reproduce this scenario in their lab.


Finally, some comparitive times on the various volume types that we 
tested while troubleshooting:

fail == system catalog data corruption
pass == no evident corruption

data1 -> (SAN-SCSI->SLV->jfs)  -> fail
data1 -> (SAN-SCSI->SLV->xfs)  -> fail
data3 -> (NFS-mounted-NAS)     -> pass (122 minutes)
data2 -> (SAN-IDE->CLV->jfs)   -> pass (103 minutes)
data1 -> (SAN-SCSI->CLV->xfs)  -> pass (94 minutes)
data1 -> (SAN-SCSI->CLV->xfs)  -> pass (93 minutes)

Times listed are the total clock time to complete the entire data load, 
using 6 parallel processes doing bulk COPY.

Hope someone finds this useful.


In response to

pgsql-hackers by date

Next:From: Tom LaneDate: 2004-08-21 15:02:55
Subject: Re: 8.0 Open Items
Previous:From: Gaetano MendolaDate: 2004-08-21 09:45:53
Subject: Re: postgres uptime

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group