Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...

From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date: 2012-10-11 03:03:26
Message-ID: 57B4F1DB-D8E1-4195-86C4-F66E61915B84@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>> Oct 5 15:00:25 db01 postgres[76648]: [5944-1] javafail(at)dbcluster 76648 0: FATAL: pipe() failed: Too many open files in system
>
> This message must be coming from initSelfPipe(), and after poking around
> a bit I think the failure must be occurring while a new backend is
> attempting to do "OwnLatch(&MyProc->procLatch)" in InitProcess. The
> reason the postmaster treats this as a crash is that the new backend
> just armed the dead-man switch (MarkPostmasterChildActive) but it exits
> without doing ProcKill which would disarm it. So this is just an
> order-of-operations bug in InitProcess: we're assuming that it can't
> fail before reaching "on_shmem_exit(ProcKill, 0)", and the latch
> additions broke that. (Though looking at it, assuming that the
> PGSemaphoreReset call cannot fail seems a tad risky too.)
>
> So that explains the crashes, but it doesn't (directly) explain why you
> had data corruption.

I've since been able to repeat this with full_page_writes=on and have had identical corruption, so I don't think this is full_page_writes related any more. I can also confirm that there was just one crash by the master database because it required manual intervention to bring back up (a backup label was sitting stale in the data dir[1]).

The slaves died during the post-crash VACUUM just as earlier. I'm completing the VACUUM now and am seeing multiple warnings from VACUUM.

WARNING: relation "tbl_a" page 2115352 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115353 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115354 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115355 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115356 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115357 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115358 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115359 is uninitialized --- fixing
WARNING: relation "tbl_a" page 2115360 is uninitialized --- fixing

On one of the slaves:

Oct 11 00:17:36 db02 postgres[66904]: [21-1] @ 66904 0: WARNING: page 120547 of relation base/16387/20196 is uninitialized
Oct 11 00:17:36 db02 postgres[66904]: [21-2] @ 66904 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 125016, lastBlockVacuumed 0
Oct 11 00:17:36 db02 postgres[66904]: [22-1] @ 66904 0: PANIC: WAL contains references to invalid pages
Oct 11 00:17:36 db02 postgres[66904]: [22-2] @ 66904 0: CONTEXT: xlog redo vacuum: rel 1663/16387/20196; blk 125016, lastBlockVacuumed 0
Oct 11 00:17:36 db02 postgres[66897]: [10-1] @ 66897 0: LOG: startup process (PID 66904) was terminated by signal 6: Abort trap
Oct 11 00:17:36 db02 postgres[66897]: [11-1] @ 66897 0: LOG: terminating any other active server processes

Hopefully this is helpful information. -sc

[1] The backup label was residual from the initial sync and wasn't cleaned up for some reason during a pg_stop_backup(). For now I'm chalking this up as a bug in repmgr even though repmgr completed cloning the slave successfully (supposedly).

START WAL LOCATION: 9F/36000020 (file 000000010000009F00000036)
CHECKPOINT LOCATION: 9F/37E9D6D8
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2012-10-06 17:48:10 UTC
LABEL: repmgr_standby_clone_1349545601

--
Sean Chittenden
sean(at)chittenden(dot)org

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Joshua D. Drake 2012-10-11 03:17:07 Re: moving from MySQL to pgsql
Previous Message Tom Lane 2012-10-11 02:09:53 Re: Can two “SELECT FOR UPDATE” statements on the same table cause a deadlock?