Re: repeatable system index corruption on 7.4.2 (SOLVED)

From: Joe Conway <mail(at)joeconway(dot)com>
To:
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: repeatable system index corruption on 7.4.2 (SOLVED)
Date: 2004-08-21 11:32:09
Message-ID: 412732B9.1030706@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Joe Conway wrote:
> Simon Riggs wrote:
>>> Joe Conway writes
>>> I'm seeing the following errors after a few hours of fairly aggressive
>>> bulk load of a database running on Postgres 7.4.2:
>>
>>> When I say aggressive, I mean up to 6 simultaneous COPY processes. It is
>>> different from the issue Tom solved the other day in that we don't get
>>> SIGABORT, just corrupt index pages.
>>
>> OK, problem accepted, but why would you run 6 simultaneous COPYs?
>> Presumably
>> on > 1 CPU? Sounds like you're hitting the right edge of the index really
>> hard (as well as finding a hole in the logic).
>
> This is fairly high end hardware -- 4 hyperthreaded CPUs (hence 8 CPUs
> from the OS perspective), 8GB RAM.
>
> But in any case, since last report we've reproduced the problem with a
> single COPY at a time.

I just want to close the loop on this thread. In summary, the problem
turned out to be related to the logical volume, and NOT Postgres. For
those interested in the gory detail, read on -- others may safely move
on ;-)

Joe

----------------------
Here's what we had, in terms of storage layout (i.e. the layout
susceptible to Postgres system catalog corruption). Note that I'm
neither the unix admin nor the storage expert -- we had our own unix
admin, a storage expert from the local VAR, and a storage expert from
the SAN vendor involved. They decided how to lay this all out -- I'll do
my best to be accurate in my description.

[------------- jfs filesystem -------------]
[------------- logical volume -------------]
[------------- volume group -------------]
+------+------+------+------+------+------+------+
| LUN1 | LUN2 | LUN3 | LUN4 | LUN5 | LUN6 | LUN7 |
+______+______+______+______+______+______+______+

LUN[1-7] are actually each comprised of 14 x 73GB x 15K rpm SCSI drives,
configured (I think) in a RAID 5 array, totaling just under 1TB of
useable space. The SAN presents each of these arrays, via fibrechannel,
to the OS as a single, large SCSI LUN. The LUNs are individually
partitioned using fdisk, a single primary partition on each, and in this
case the partition was offset 128MB to allow for stripe alignment (more
on stripes later).

The volume group creates a block device. We used 128MB physical extent
size to allow for up to an 8TB logical volume.

The logical volume is then created on top of the volume group. We
created a single, 6.4TB logical volume using 128MB stripes across the
LUNs in the volume group.

Finally the filesytem is laid on top. We started out with jfs based on
vendor recommendations (I believe). I'll represent this topology a bit
more compactly as (SAN-SCSI->SLV->jfs), where SLV stands for striped
logical volume.

----------------------

In the above configuration we consistently got data corruption (as
evidenced by bad system indexes) regardless of the Postgres
configuration, and with 1 or 6 parallel data loads. We then changed the
filesystem to xfs. With an xfs filesystem, we got substantially farther
into our data load, but corruption did eventually occur.

After a variety of successful data loads (i.e. no corruption) on other
volumes, with different topologies, we decided that the problem was
related to either the logical volume level, or the SAN hardware itself.

So we deleted the logical volume and the volume group. Then we
re-partitioned the LUNs -- this time without the offset for stripe
alignment. The volume group was rebuilt using 128MB extents again, but
the logical volume was build using concatenated, instead of striped,
LUNs. Finally we formatted with xfs. That is, (SAN-SCSI->CLV->xfs),
where CLV is concatenated logical volume.

(SAN-SCSI->CLV->xfs) worked with no data corruption. We're not entirely
sure why the original configuration was a problem, but the SAN vendor
has agreed to try to reproduce this scenario in their lab.

----------------------

Finally, some comparitive times on the various volume types that we
tested while troubleshooting:

fail == system catalog data corruption
pass == no evident corruption

data1 -> (SAN-SCSI->SLV->jfs) -> fail
data1 -> (SAN-SCSI->SLV->xfs) -> fail
data3 -> (NFS-mounted-NAS) -> pass (122 minutes)
data2 -> (SAN-IDE->CLV->jfs) -> pass (103 minutes)
data1 -> (SAN-SCSI->CLV->xfs) -> pass (94 minutes)
data1 -> (SAN-SCSI->CLV->xfs) -> pass (93 minutes)

Times listed are the total clock time to complete the entire data load,
using 6 parallel processes doing bulk COPY.

Hope someone finds this useful.

Joe

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-08-21 15:02:55 Re: 8.0 Open Items
Previous Message Gaetano Mendola 2004-08-21 09:45:53 Re: postgres uptime