Problem with 9.1 streaming replication

From: Georges Racinet <gracinet(at)anybox(dot)fr>
To: pgsql-general(at)postgresql(dot)org
Subject: Problem with 9.1 streaming replication
Date: 2012-07-23 12:09:32
Message-ID: 500D3EFC.7050406@anybox.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all.

While testing a replication setup with PostgreSQL 9.1.4, I'm having an
error after promoting the slave to master : some file under the 'base'
subdirectory could not be read, that only 0 bytes could be fetched (see
the log extract at the end) Indeed the actual file size is 0.
I believe that whatever configuration mistake I may have made, such a
corruption should never happen, isn't it ?

That error is persistent accross the cluster restarts. Basically, the DB
is corrupted, almost nothing works. The only option is to reconstruct it
from a dump.

The replication itself works, I'm able to start it with pg_basebackup in
both ways.

I thought for a while that the error happended because I hade made the
mistake not to configure wal_keep_segments (didn't realize the default
value was not small but actually zero). Is that realistic

I set it since the first attempts to a value that I believe to be
generous (1024, that should mean 16 GB of WAL). After that, I had a
succesful failover simulation.

But the error got back with the same fatal corruption symptoms
yesterday. It seems to be correlated to the size of data being
replicated. This time, that was right after a pg_restore. (dumps in
custom format are around 50 MB).

The bandwith between the servers is quite sufficient : I witnessed up to
70 MB/s with rsync.

Promotion is done with Debian's pg_ctlcluster promote, which I believe
to be like other Debian tools a wrapper to select the right cluster.
Application software starts after the promotion.

Any hint appreciated, thanks !

Precise version: 9.1.4-2~bpo60+1 from Debian squeeze-backports

Log extract (french locale, here):
2012-07-22 21:27:59 UTC LOG: restauration termin?e de l'archive
2012-07-22 21:27:59 UTC LOG: le syst?me de bases de donn?es est pr?t
pour accepter les connexions
2012-07-22 21:27:59 UTC LOG: lancement du processus autovacuum
2012-07-22 21:30:19 UTC ERREUR: n'a pas pu lire le bloc 0 du fichier «
base/142824/151268 » : a lu seulement 0 octets
sur 8192

--
Georges Racinet
Anybox SAS, http://anybox.fr
Bureau: 09 53 53 72 97 Portable: 06 51 32 07 27
GPG: 0x33AB0A35, sur serveurs publics

Browse pgsql-general by date

  From Date Subject
Next Message Robert Haas 2012-07-23 15:34:26 Re: postgres 9 bind address for replication
Previous Message Pavel Stehule 2012-07-23 11:14:30 Re: PL/pgSQL - Help or advice please on using unbound cursors