Re: Standby stopped working after PANIC: WAL contains references to invalid pages

From: Dan Kogan <dan(at)iqtell(dot)com>
To: Lonni J Friedman <netllama(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Standby stopped working after PANIC: WAL contains references to invalid pages
Date: 2013-06-22 23:17:11
Message-ID: 60B572D9298D944580F7D51195DD3080468902FFD1@VMBX125.ihostexchange.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Re-seeding the standby with a full base backup does seem to make the error go away.
The standby started, caught up and has been working for about 2 hours.

The file in the error message was an index. We rebuilt it just in case.
Is there any way to debug the issue at this point?

-----Original Message-----
From: Lonni J Friedman [mailto:netllama(at)gmail(dot)com]
Sent: Saturday, June 22, 2013 4:11 PM
To: Dan Kogan
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] Standby stopped working after PANIC: WAL contains references to invalid pages

Looks like some kind of data corruption. Question is whether it came from the master, or was created by the standby. If you re-seed the standby with a full (base) backup, does the problem go away?

On Sat, Jun 22, 2013 at 12:43 PM, Dan Kogan <dan(at)iqtell(dot)com> wrote:
> Hello,
>
>
>
> Today our standby instance stopped working with this error in the log:
>
>
>
> 2013-06-22 16:27:32 UTC [8367]: [247-1] [] WARNING: page 158130 of
> relation
> pg_tblspc/16447/PG_9.2_201204301/16448/39154429 is uninitialized
>
> 2013-06-22 16:27:32 UTC [8367]: [248-1] [] CONTEXT: xlog redo vacuum:
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8367]: [249-1] [] PANIC: WAL contains
> references to invalid pages
>
> 2013-06-22 16:27:32 UTC [8367]: [250-1] [] CONTEXT: xlog redo vacuum:
> rel 16447/16448/39154429; blk 158134, lastBlockVacuumed 158129
>
> 2013-06-22 16:27:32 UTC [8366]: [3-1] [] LOG: startup process (PID
> 8367) was terminated by signal 6: Aborted
>
> 2013-06-22 16:27:32 UTC [8366]: [4-1] [] LOG: terminating any other
> active server processes
>
>
>
> After re-start the same exact error occurred.
>
>
>
> We thought that maybe we hit this bug -
> http://postgresql.1045698.n5.nabble.com/Completely-broken-replica-after-PANIC-WAL-contains-references-to-invalid-pages-td5750072.html.
>
> However, there is nothing in our log about sub-transactions, so it
> didn't seem the same to us.
>
>
>
> Any advice on how to further debug this so we can avoid this in the
> future is appreciated.
>
>
>
> Environment:
>
>
>
> AWS, High I/O instance (hi1.4xlarge), 60GB RAM
>
>
>
> Software and settings:
>
>
>
> PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc
> (Ubuntu/Linaro
> 4.5.2-8ubuntu4) 4.5.2, 64-bit
>
>
>
> archive_command rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode on
>
> autovacuum_freeze_max_age 1000000000
>
> autovacuum_max_workers 6
>
> checkpoint_completion_target 0.9
>
> checkpoint_segments 128
>
> checkpoint_timeout 30min
>
> default_text_search_config pg_catalog.english
>
> hot_standby on
>
> lc_messages en_US.UTF-8
>
> lc_monetary en_US.UTF-8
>
> lc_numeric en_US.UTF-8
>
> lc_time en_US.UTF-8
>
> listen_addresses *
>
> log_checkpoints on
>
> log_destination stderr
>
> log_line_prefix %t [%p]: [%l-1] [%h]
>
> log_min_duration_statement -1
>
> log_min_error_statement error
>
> log_min_messages error
>
> log_timezone UTC
>
> maintenance_work_mem 1GB
>
> max_connections 1200
>
> max_standby_streaming_delay 90s
>
> max_wal_senders 5
>
> port 5432
>
> random_page_cost 2
>
> seq_page_cost 1
>
> shared_buffers 4GB
>
> ssl off
>
> ssl_cert_file /etc/ssl/certs/ssl-cert-snakeoil.pem
>
> ssl_key_file /etc/ssl/private/ssl-cert-snakeoil.key
>
> synchronous_commit off
>
> TimeZone UTC
>
> wal_keep_segments 128
>
> wal_level hot_standby
>
> work_mem 8MB
>
>
>
> root(at)ip-10-148-131-236:~# /usr/local/pgsql/bin/pg_controldata
> /usr/local/pgsql/data
>
> pg_control version number: 922
>
> Catalog version number: 201204301
>
> Database system identifier: 5838668587531239413
>
> Database cluster state: in archive recovery
>
> pg_control last modified: Sat 22 Jun 2013 06:13:07 PM UTC
>
> Latest checkpoint location: 2250/18CA0790
>
> Prior checkpoint location: 2250/18CA0790
>
> Latest checkpoint's REDO location: 224F/E127B078
>
> Latest checkpoint's TimeLineID: 2
>
> Latest checkpoint's full_page_writes: on
>
> Latest checkpoint's NextXID: 1/2018629527
>
> Latest checkpoint's NextOID: 43086248
>
> Latest checkpoint's NextMultiXactId: 7088726
>
> Latest checkpoint's NextMultiOffset: 20617234
>
> Latest checkpoint's oldestXID: 1690316999
>
> Latest checkpoint's oldestXID's DB: 16448
>
> Latest checkpoint's oldestActiveXID: 2018629527
>
> Time of latest checkpoint: Sat 22 Jun 2013 03:24:05 PM UTC
>
> Minimum recovery ending location: 2251/5EA631F0
>
> Backup start location: 0/0
>
> Backup end location: 0/0
>
> End-of-backup record required: no
>
> Current wal_level setting: hot_standby
>
> Current max_connections setting: 1200
>
> Current max_prepared_xacts setting: 0
>
> Current max_locks_per_xact setting: 64
>
> Maximum data alignment: 8
>
> Database block size: 8192
>
> Blocks per segment of large relation: 131072
>
> WAL block size: 8192
>
> Bytes per WAL segment: 16777216
>
> Maximum length of identifiers: 64
>
> Maximum columns in an index: 32
>
> Maximum size of a TOAST chunk: 1996
>
> Date/time type storage: 64-bit integers
>
> Float4 argument passing: by value
>
> Float8 argument passing: by value
>
> root(at)ip-10-148-131-236:~#
>
>
>
> Thanks again.
>
>
>
> Dan

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L. Friedman netllama(at)gmail(dot)com
LlamaLand https://netllama.linux-sxs.org

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Lonni J Friedman 2013-06-23 02:08:43 Re: Standby stopped working after PANIC: WAL contains references to invalid pages
Previous Message Dan Kogan 2013-06-22 20:18:54 Re: Standby stopped working after PANIC: WAL contains references to invalid pages