Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From: Dan Kogan <dan(at)iqtell(dot)com>
To: Lonni J Friedman <netllama(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Standby stopped working after PANIC: WAL contains references to invalid pages
Date: 2013-06-24 13:44:09
Message-ID: 60B572D9298D944580F7D51195DD3080468902FFE8@VMBX125.ihostexchange.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

We have backed up $PGDATA, but had to re-initialize the slave.
We also have the WALs from the day this happened.

Thanks,
Dan

-----Original Message-----
From: Lonni J Friedman [mailto:netllama(at)gmail(dot)com]
Sent: Saturday, June 22, 2013 10:09 PM
To: Dan Kogan
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

Assuming that you still have $PGDATA from the broken instance (such that you can reproduce the crash again), there might be a way to debug it further. I'd guess that something like bad RAM or storage could cause an index to get corrupted in this fashion, but the fact that you're using AWS makes that less likely. Someone far more knowledgeable than I will need to provide guidance on how to debug this though.

On Sat, Jun 22, 2013 at 4:17 PM, Dan Kogan <dan(at)iqtell(dot)com> wrote:
> Re-seeding the standby with a full base backup does seem to make the error go away.
> The standby started, caught up and has been working for about 2 hours.
>
> The file in the error message was an index. We rebuilt it just in case.
> Is there any way to debug the issue at this point?
>
>
>
> -----Original Message-----
> From: Lonni J Friedman [mailto:netllama(at)gmail(dot)com]
> Sent: Saturday, June 22, 2013 4:11 PM
> To: Dan Kogan
> Cc: pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL
> contains references to invalid pages
>
> Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. If you re-seed the standby with a full (base) backup, does the problem go away?
>
> On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan(at)iqtell(dot)com> wrote:
>> Hello,
>>
>>
>>
>> Today our standby instance stopped working with this error in the log:
>>
>>
>>
>> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of
>> relation
>> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>>
>> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains
>> references to invalid pages
>>
>> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum:
>> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>>
>> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID
>> 8367) was terminated by signal 6: Aborted
>>
>> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other
>> active server processes
>>
>>
>>
>> After re-start the same exact error occurred.
>>
>>
>>
>> We thought that maybe we hit this bug -
>> http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>>
>> However, there is nothing in our log about sub-transactions, so it
>> didn't seem the same to us.
>>
>>
>>
>> Any advice on how to further debug this so we can avoid this in the
>> future is appreciated.
>>
>>
>>
>> Environment:
>>
>>
>>
>> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>>
>>
>>
>> Software and settings:
>>
>>
>>
>> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc
>> (Ubuntu/Linaro
>> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>>
>>
>>
>> archive_command rsync -a %p
>> slave:/var/lib/postgresql/replication_load/%f
>>
>> archive_mode on
>>
>> autovacuum_freeze_max_age 1000000000
>>
>> autovacuum_max_workers 6
>>
>> checkpoint_completion_target 0.9
>>
>> checkpoint_segments 128
>>
>> checkpoint_timeout 30min
>>
>> default_text_search_config pg_catalog.english
>>
>> hot_standby on
>>
>> lc_messages en_US.UTF-8
>>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Dmitriy Igrishin 2013-06-24 13:55:23 Re: [HACKERS] Frontend/backend protocol improvements proposal (request).
Previous Message Tom Lane 2013-06-24 13:39:18 Re: postgres_fdw changes schema search path