Re: BUG #13822: Slave terminated - WAL contains references to invalid page

From: <Marek(dot)Petr(at)tieto(dot)com>
To: <michael(dot)paquier(at)gmail(dot)com>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13822: Slave terminated - WAL contains references to invalid page
Date: 2015-12-28 11:50:07
Message-ID: f90c2c290bdd4629b3684ca351ab3261@C105S135VM024.eu.tieto.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tried to use pageinspect module for affected pages from last two occurences:

2015-12-15 13:05:39 CET @ WARNING: page 4333275 of relation base/16422/17230 is uninitialized
2015-12-22 00:25:11 CET @ WARNING: page 71566 of relation base/16422/23253 is uninitialized

Following outputs are the same for master and slave:

# select relname from pg_class where relfilenode in ('17230','23253');
relname
----------------
pg_toast_17225
pg_toast_23246
(2 rows)

# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_23246', 71566));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
181/1BF593A8 | 0 | 4 | 44 | 600 | 8192 | 8192 | 4 | 0
(1 row)

# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_17225', 4333275));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
172/30BECB58 | 0 | 4 | 40 | 64 | 8192 | 8192 | 4 | 0
(1 row)

# SELECT * FROM heap_page_items(get_raw_page('pg_toast.pg_toast_23246', 71566));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+-----------+--------+----------+-----------+-------------+------------+--------+--------+-------
1 | 6160 | 1 | 2032 | 364894821 | 0 | 1 | (71566,1) | 3 | 2306 | 24 | |
2 | 5160 | 1 | 994 | 364894821 | 0 | 1 | (71566,2) | 3 | 2306 | 24 | |
3 | 3128 | 1 | 2032 | 364894934 | 0 | 1 | (71566,3) | 3 | 2306 | 24 | |
4 | 2632 | 1 | 496 | 364894934 | 0 | 1 | (71566,4) | 3 | 2306 | 24 | |
5 | 600 | 1 | 2032 | 364895241 | 0 | 1 | (71566,5) | 3 | 2306 | 24 | |
(5 rows)

# SELECT * FROM heap_page_items(get_raw_page('pg_toast.pg_toast_17225', 4333275));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+-----------+--------+----------+-------------+-------------+------------+--------+--------+-------
1 | 6160 | 1 | 2032 | 360224238 | 0 | 1 | (4333275,1) | 3 | 2306 | 24 | |
2 | 4128 | 1 | 2032 | 360224238 | 0 | 1 | (4333275,2) | 3 | 2306 | 24 | |
3 | 2096 | 1 | 2032 | 360224238 | 0 | 1 | (4333275,3) | 3 | 2306 | 24 | |
4 | 64 | 1 | 2032 | 360224238 | 0 | 1 | (4333275,4) | 3 | 2306 | 24 | |
(4 rows)

Non-default pars:

listen_addresses = '*'
max_connections = 600
ssl = on
ssl_cert_file = 'xxx.crt'
ssl_key_file = 'xxx.key'
ssl_ca_file = 'xxx.crt'
shared_buffers = 4GB
work_mem = 6990kB
maintenance_work_mem = 819MB
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p xxx(at)xxx:/data/wal_arch/%f'
max_wal_senders = 5
wal_keep_segments = 64
hot_standby = on
effective_cache_size = 12GB

Slave rebuilded and it's running almost a week for now.

Regards
Marek

-----Original Message-----
From: Michael Paquier [mailto:michael(dot)paquier(at)gmail(dot)com]
Sent: Saturday, December 26, 2015 2:15 PM
To: Petr Marek <Marek(dot)Petr(at)tieto(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

On Tue, Dec 22, 2015 at 9:05 PM, <Marek(dot)Petr(at)tieto(dot)com> wrote:
> 2015-12-22 00:25:11 CET @ WARNING: page 71566 of relation
> base/16422/23253 is uninitialized
> 2015-12-22 00:25:11 CET @ CONTEXT: xlog redo visible: rel
> 1663/16422/23253; blk 71566
> 2015-12-22 00:25:11 CET @ PANIC: WAL contains references to invalid
> pages
> 2015-12-22 00:25:11 CET @ CONTEXT: xlog redo visible: rel
> 1663/16422/23253; blk 71566
> 2015-12-22 00:25:12 CET @ LOG: startup process (PID 24434) was
> terminated by signal 6: Aborted
> 2015-12-22 00:25:12 CET @ LOG: terminating any other active server
> processes

Looking more closely at that, this is the code path of the redo routine for XLOG_HEAP2_VISIBLE. I have been looking at the area of the code around visibilitymap_set to try to see if there could be a race condition with another backend extending the relation and causing the page to be uninitialized but have not found anything yet. 9.4 has been out for some time, and this is the first report of this kind for this redo routine. Still, you have been able to reproduce the problem twice, so this has the smell of a bug... Others, opinions?

Did you rebuild a new slave and let the master running, and perhaps some data corruption is coming from it? What's the state of the same pages on the master? Are they zero'ed?

Also, are you using any parameter with a value different than the default. I don't know fsync, full_page_writes...

> select relname from pg_class where relfilenode in ('17230','23253');
> relname
> ----------------
> pg_toast_17225
> pg_toast_23246
> (2 rows)
>
> First toast's relation has 34GB, second 2452 MB.
> Is it possible to get more info from some deeper logging for the case it will occur again?

I am not sure to understand what you are looking for here. You could make the logs more verbose but this would bloat your log partition...
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Marko Tiikkaja 2015-12-28 14:50:16 Re: BUG #13833: Postgres ORDER BY value inside json causes “column does not exist” error
Previous Message alain.laporte12345 2015-12-28 11:03:50 BUG #13836: Documentation not consistent about pg_commit_ts directory