Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

From: Sean Laurent <sean(at)studyblue(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections
Date: 2011-10-06 17:21:52
Message-ID: CAK=aZ=k+QSGZFCE8SX8-KbgYDJZy+-5ebmFs3aTLZEkSBb3LQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

We've been running into a particularly strange problem that I'm trying to
better understand. The super short version is that our application servers
lose their connection to the database when I run a backup during periods of
higher load and fail to reconnect.

Here's an overview of the setup:

- PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running
CentOS 5.6
- 8 disk RAID-0 array of EBS volumes used for primary data storage
- 4 disk RAID-0 array of EBS volumes used for transaction logs
- Root partition is ext3
- RAID arrays are xfs

Backups are taken using a script that runs the following workflow:

- Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup');
- Run "xfs_freeze" on the primary RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the primary RAID array
- Run "xfs_freeze" on the transaction log RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the transaction log RAID array
- Tell Postgres the backup is finished: SELECT pg_stop_backup();
- Remove old WAL files

The whole process takes roughly 7 seconds on average. The RAID arrays are
frozen for roughly 2 seconds on average.

Within a few seconds of the backup, our application servers start throwing
exceptions that indicate the database connection was closed. Meanwhile,
Postgres still shows the connections and we start seeing a really high
number (for us) of locks in the database. The application servers refuse to
recover and must be killed and restarted. Once they're killed off, the
connections actually go away and the locks disappear.

What's particularly weird is that this doesn't happen all the time. The
backups were running every hour, but we have only seen the app servers crash
5-10 times over the course of a month.

Has anyone encountered anything like this? Do any of these steps have
ramifications that I'm not considering? Especially something that might
explain the app server failure?

Thanks.

Sean Laurent
Director of Operations
StudyBlue, Inc.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Carlos Mennens 2011-10-06 18:31:59 Tuning Variables For PostgreSQL
Previous Message Adam Cornett 2011-10-06 16:20:23 Re: Backup Database Question