Re: BUG #7500: hot-standby replica crash after an initial rsync

From: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #7500: hot-standby replica crash after an initial rsync
Date: 2012-08-22 10:08:25
Message-ID: CAK-MWwQdxsE3fvi2HzHeiAr9sGNp3ESx=B6gD=TXqZW6e2mf8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> > Does that mean that file was damaged during rsync?
> Not necessarily. When did you initially set up that cluster? Normally the
> file
> should get zeroed out before its being used. If the cluster was copied
> improperly (i.e. no pg_start/stop backup or such) it could easily happen.
> But
> I would guess that you did it properly?
>

That server was master before planned switchover to replica server was
performed.
Procedure:
1)stop former master
2)trigger replica failover (thus server become new master)

There are unusual part because human error from system administrator:
2.1)start former master
2.2)stop former master

3)empty pg_xlog directory on former master
4)pg_start_backup on new master
5)re-resync datafiles to former master using:
rsync -aq --delete --exclude pg_xlog /db/data [former master]:/db/
6)pg_stop_backup on new master
7)start former master as replica...

that procedure used more than 10 times without any problem in the past.
May be I missing something in it.

>
> Could you show pg_controldata output for primary/replica? To see how new
> that
> multixact offset is.
>

On failing replica:
[pgsql(at)db1 ~/tmp/postgresql-9.0.8]$ pg_controldata /db/data_back
pg_control version number: 903
Catalog version number: 201008051
Database system identifier: 5687760461964967531
Database cluster state: in archive recovery
pg_control last modified: Wed Aug 22 13:47:35 2012
Latest checkpoint location: 925/2D41CAA0
Prior checkpoint location: 925/2D41CAA0
Latest checkpoint's REDO location: 925/2C0004C8
Latest checkpoint's TimeLineID: 5
Latest checkpoint's NextXID: 0/1768511856
Latest checkpoint's NextOID: 59880938
Latest checkpoint's NextMultiXactId: 361788644
Latest checkpoint's NextMultiOffset: 822580965
Latest checkpoint's oldestXID: 1569182525
Latest checkpoint's oldestXID's DB: 16424
Latest checkpoint's oldestActiveXID: 1768498919
Time of latest checkpoint: Tue Aug 21 08:21:42 2012
Minimum recovery ending location: 925/2C0004C8
Backup start location: 925/2C0004C8
Current wal_level setting: hot_standby
Current max_connections setting: 300
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 2048
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value

On master no sense to look into it because master 1 day ahead of that
replica.
It's a production system.

>
> > What next I should examine?
> If you could give me a DEBUG3 log from startup till it crashes it would be
> helpful. Please #define MULTIXACT_DEBUG ontop of multixact.c before doing
> so.
> Thats probably going to be to big to go through the mailinglist.
>

I see nothing new after I rebuit postgres with that define (
/usr/local/pgsql/tmp/postgresql/bin/postgres -d 3 -D /db/data_back/ ).
Just:

LOCATION: readRecoveryCommandFile, xlog.c:5269
DEBUG: 00000: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica
password=acilper'
LOCATION: readRecoveryCommandFile, xlog.c:5276
DEBUG: 00000: trigger_file = '/db/data/failover'
LOCATION: readRecoveryCommandFile, xlog.c:5283
LOG: 00000: entering standby mode
LOCATION: StartupXLOG, xlog.c:5827
DEBUG: 00000: checkpoint record is at 925/2D41CAA0
LOCATION: StartupXLOG, xlog.c:5894
DEBUG: 00000: redo record is at 925/2C0004C8; shutdown FALSE
LOCATION: StartupXLOG, xlog.c:5929
DEBUG: 00000: next transaction ID: 0/1768511856; next OID: 59880938
LOCATION: StartupXLOG, xlog.c:5933
DEBUG: 00000: next MultiXactId: 361788644; next MultiXactOffset: 822580965
LOCATION: StartupXLOG, xlog.c:5936
DEBUG: 00000: oldest unfrozen transaction ID: 1569182525, in database 16424
LOCATION: StartupXLOG, xlog.c:5939
DEBUG: 00000: MultiXact: setting next multi to 361788644 offset 822580965
LOCATION: MultiXactSetNextMXact, multixact.c:1668
DEBUG: 00000: transaction ID wrap limit is 3716666172, limited by database
with OID 16424
LOCATION: SetTransactionIdLimit, varsup.c:334
WARNING: 01000: pid of startup is: 16919, sleeping for 60s
LOCATION: StartupXLOG, xlog.c:5983
DEBUG: 00000: initializing for hot standby
LOCATION: StartupXLOG, xlog.c:6073
LOG: 00000: redo starts at 925/2C0004C8
LOCATION: StartupXLOG, xlog.c:6202
FATAL: XX000: could not access status of transaction 361788644
DETAIL: Could not read from file "pg_multixact/offsets/1590" at offset
114688: Unknown error: 0.
CONTEXT: xlog redo create multixact 361788644 offset 822580965: 1768508061
1768508078
LOCATION: SlruReportIOError, slru.c:840

>
> If youre feeling adventurous you could use the xlogdump project
> (https://github.com/snaga/xlogdump) to see what xlog records happen from
> the
> time it starts up till it crashes. I don't immediately see a way a
> XLOG_MULTIXACT_CREATE_ID can be generated without a preceding
> XLOG_MULTIXACT_ZERO_OFF_PAGE zeroing the page.
> But most of that should be already be visible in the log with the above
> define.
>
> Did you have any crashes, -m immediate shutdowns, or similar on the
> primary or
> replica?
>

I don't think. But I'm not sure because task was performed by system
administrator using provided instruction.
And I was informed only after replica doesn't start.

Kind Regards,
Maksym

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2012-08-22 12:40:11 Re: BUG #7500: hot-standby replica crash after an initial rsync
Previous Message Sachin Srivastava 2012-08-22 09:40:36 Re: postgresql-9.1.4 macosx-10.6.8 (hostssl) psql -h hostname crashes