Hanging startup process on the replica after vacuuming on master

From: Vladimir Borodin <root(at)simply(dot)name>
To: pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Hanging startup process on the replica after vacuuming on master
Date: 2015-03-25 16:32:01
Message-ID: FE82A9A7-0D52-41B5-A9ED-967F6927CB8A@simply.name
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi all.

We have already described [0] our problem on pgsql-general@ and written to pgsql-performance@ but the problem was described too confused so let me ask for help one for time in more clear way.

We have a database where one table contains 3+ billion records. The definition of the table is quite simple:

rpopdb04d/rpopdb M # \dS+ rpop.rpop_uidl
Table "rpop.rpop_uidl"
Column | Type | Modifiers | Storage | Stats target | Description
--------+------------------------+-----------+----------+--------------+-------------
popid | bigint | not null | plain | |
uidl | character varying(200) | not null | extended | |
Indexes:
"pk_rpop_uidl" PRIMARY KEY, btree (popid, uidl)

rpopdb04d/rpopdb M #

When vacuuming of this table finishes on master we see hanging of replay_location on all replicas on specific WAL position for up to 15 minutes. Other positions (sent_location, write_location, flush_location) in pg_stat_replication view continue increasing as expected.

rpopdb04d/postgres M # select client_hostname, sent_location, write_location, flush_location, replay_location from pg_stat_replication; \watch 1
<...>
Watch every 1s Fri Mar 20 14:15:16 2015

client_hostname | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
rpopdb04g.domain.com | 413F/CD8223D8 | 413F/CD81EDF8 | 413F/CD81EDF8 | 413F/CD81D4C0
rpopdb04e.domain.com | 413F/CD8223D8 | 413F/CD81D4C0 | 413F/CD81D4C0 | 413F/CD81D3E0
(2 rows)

Watch every 1s Fri Mar 20 14:15:17 2015

client_hostname | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
rpopdb04g.domain.com | 413F/CD9887C0 | 413F/CD9884C0 | 413F/CD9881F0 | 413F/CD988058
rpopdb04e.domain.com | 413F/CD9887C0 | 413F/CD988570 | 413F/CD9884C0 | 413F/CD9884C0
(2 rows)

Watch every 1s Fri Mar 20 14:15:18 2015

client_hostname | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
rpopdb04g.domain.com | 413F/CDA89F38 | 413F/CDA89F38 | 413F/CDA89EA0 | 413F/CDA131C8
rpopdb04e.domain.com | 413F/CDA89F38 | 413F/CDA89EA0 | 413F/CDA89EA0 | 413F/CDA131C8
(2 rows)

Watch every 1s Fri Mar 20 14:15:19 2015

client_hostname | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
rpopdb04g.domain.com | 413F/CDB8DA20 | 413F/CDB8D7C0 | 413F/CDB8D630 | 413F/CDA131C8
rpopdb04e.domain.com | 413F/CDB8DA20 | 413F/CDB8D630 | 413F/CDB8D630 | 413F/CDA131C8

<...>
<9 minutes of stuck replay_location>
<...>
Watch every 1s Fri Mar 20 14:24:11 2015

client_hostname | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
rpopdb04g.domain.com | 4140/1F7E28C8 | 4140/1F7E2838 | 4140/1F7E2838 | 413F/CDA131C8
rpopdb04e.domain.com | 4140/1F7E28C8 | 4140/1F7E2550 | 4140/1F7E2550 | 413F/CDA131C8
(2 rows)

Watch every 1s Fri Mar 20 14:24:12 2015

client_hostname | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
rpopdb04g.domain.com | 4140/1F9F6A28 | 4140/1F9F6508 | 4140/1F9F6508 | 413F/CDA131C8
rpopdb04e.domain.com | 4140/1F9F6A28 | 4140/1F9F6508 | 4140/1F9F6508 | 413F/CE258D48
(2 rows)

Watch every 1s Fri Mar 20 14:24:13 2015

client_hostname | sent_location | write_location | flush_location | replay_location
---------------------------+---------------+----------------+----------------+-----------------
rpopdb04g.domain.com | 4140/203BA108 | 4140/203BA108 | 4140/203BA108 | 413F/CDA131C8
rpopdb04e.domain.com | 4140/203BA108 | 4140/203BA108 | 4140/203BA108 | 413F/CE853C38
(2 rows)

The content of this WAL position according to pg_xlogdump is the following (actually it is the output of different playback of the problem so WAL positions differ but the essence is the same):
rmgr: Btree len (rec/tot): 20/ 52, tx: 0, lsn: 4115/56126DC0, prev 4115/56126D90, bkp: 0000, desc: vacuum: rel 1663/16420/16796; blk 31222118, lastBlockVacuumed 0
This OID (16420/16796) belongs to PK of this table.

We have also noticed that while replay_location is stuck, startup process on the replica consumes a lot of I/O reading lots of data from array with $PGDATA (we have xlogs on a separate array). Most of the backtraces of startup process in this time looks like that:

0x00007f54a71444c0 in __read_nocancel () from /lib64/libc.so.6
#0 0x00007f54a71444c0 in __read_nocancel () from /lib64/libc.so.6
#1 0x000000000065d2f5 in FileRead (file=<value optimized out>, buffer=0x7f53ac0dba20 ";9", amount=8192) at fd.c:1286
#2 0x000000000067acad in mdread (reln=<value optimized out>, forknum=<value optimized out>, blocknum=12063036, buffer=0x7f53ac0dba20 ";9") at md.c:679
#3 0x0000000000659b4e in ReadBuffer_common (smgr=<value optimized out>, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=12063036, mode=RBM_NORMAL_NO_LOG, strategy=0x0, hit=0x7fff898a912f "") at bu
fmgr.c:476
#4 0x000000000065a61b in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM, blockNum=12063036, mode=<value optimized out>, strategy=<value optimized out>) at bufmgr.c:287
#5 0x00000000004cfb78 in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=12063036, mode=RBM_NORMAL_NO_LOG) at xlogutils.c:324
#6 0x00000000004a3651 in btree_xlog_vacuum (lsn=71744181579864, record=0x1e49dc0) at nbtxlog.c:522
#7 btree_redo (lsn=71744181579864, record=0x1e49dc0) at nbtxlog.c:1144
#8 0x00000000004c903a in StartupXLOG () at xlog.c:6827
#9 0x000000000062f8bf in StartupProcessMain () at startup.c:224
#10 0x00000000004d3e9a in AuxiliaryProcessMain (argc=2, argv=0x7fff898a98a0) at bootstrap.c:416
#11 0x000000000062a99c in StartChildProcess (type=StartupProcess) at postmaster.c:5146
#12 0x000000000062e9e2 in PostmasterMain (argc=3, argv=<value optimized out>) at postmaster.c:1237
#13 0x00000000005c7d68 in main (argc=3, argv=0x1e22910) at main.c:228

Investigating the source code of btree_xlog_vacuum function we have found that it calls XLogReadBufferExtended with mode=RBM_NORMAL_NO_LOG only in one case:

/*
* If queries might be active then we need to ensure every leaf page is
* unpinned between the lastBlockVacuumed and the current block, if there
* are any. This prevents replay of the VACUUM from reaching the stage of
* removing heap tuples while there could still be indexscans "in flight"
* to those particular tuples (see nbtree/README).
*
* It might be worth checking if there are actually any backends running;
* if not, we could just skip this.
*
* Since VACUUM can visit leaf pages out-of-order, it might issue records
* with lastBlockVacuumed >= block; that's not an error, it just means
* nothing to do now.
*
* Note: since we touch all pages in the range, we will lock non-leaf
* pages, and also any empty (all-zero) pages that may be in the index. It
* doesn't seem worth the complexity to avoid that. But it's important
* that HotStandbyActiveInReplay() will not return true if the database
* isn't yet consistent; so we need not fear reading still-corrupt blocks
* here during crash recovery.
*/
if (HotStandbyActiveInReplay())
{
BlockNumber blkno;

for (blkno = xlrec->lastBlockVacuumed + 1; blkno < xlrec->block; blkno++)
{
/*
* We use RBM_NORMAL_NO_LOG mode because it's not an error
* condition to see all-zero pages. The original btvacuumpage
* scan would have skipped over all-zero pages, noting them in FSM
* but not bothering to initialize them just yet; so we mustn't
* throw an error here. (We could skip acquiring the cleanup lock
* if PageIsNew, but it's probably not worth the cycles to test.)
*
* XXX we don't actually need to read the block, we just need to
* confirm it is unpinned. If we had a special call into the
* buffer manager we could optimise this so that if the block is
* not in shared_buffers we confirm it as unpinned.
*/
buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno,
RBM_NORMAL_NO_LOG);

And if I understand everything right, while replaying this WAL record startup process reads from disk 31222118*8/1024/1024 ~ 238 GB of data (seems like the full size of the index).

Below are some details about our environment but my question is what we can do not to have delaying replicas? Any help would be really appropriate.

All hosts (master and two replicas) run RHEL 6.6 with PostgreSQL 9.4.1 (upgraded from 9.3.6 where we had the same problem). We use built-in streaming replication and the same configuration on all hosts:

rpopdb04g/postgres R # select name, setting from pg_settings where setting != boot_val and source = 'configuration file';
name | setting
---------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------
archive_command | rsync -a --contimeout=1 --timeout=5 --password-file=/etc/barman.passwd %p rsync://barman(at)pg-backup04h(dot)domain(dot)com:/barman/rpopdb04/incoming/%f
archive_mode | on
autovacuum_analyze_scale_factor | 0.001
autovacuum_max_workers | 10
autovacuum_naptime | 15
autovacuum_vacuum_cost_delay | 5
autovacuum_vacuum_cost_limit | 2000
autovacuum_vacuum_scale_factor | 1e-05
autovacuum_vacuum_threshold | 200
bgwriter_delay | 50
bgwriter_lru_maxpages | 250
checkpoint_completion_target | 0.8
checkpoint_segments | 384
checkpoint_timeout | 600
checkpoint_warning | 420
DateStyle | ISO, DMY
default_text_search_config | pg_catalog.english
effective_cache_size | 13107200
hot_standby | on
hot_standby_feedback | on
lc_messages | en_US.UTF-8
lc_monetary | en_US.UTF-8
lc_numeric | en_US.UTF-8
lc_time | en_US.UTF-8
listen_addresses | *
log_autovacuum_min_duration | 0
log_checkpoints | on
log_hostname | on
log_line_prefix | < %m >
log_lock_waits | on
log_rotation_size | 0
log_statement | ddl
log_timezone | Europe/Moscow
log_truncate_on_rotation | on
logging_collector | on
maintenance_work_mem | 2097152
max_connections | 4000
max_wal_senders | 10
pg_stat_statements.max | 10000
pg_stat_statements.track | all
shared_buffers | 524288
shared_preload_libraries | pg_stat_statements
TimeZone | Europe/Moscow
track_functions | all
track_io_timing | on
update_process_title | off
vacuum_defer_cleanup_age | 200000
wal_buffers | 8192
wal_keep_segments | 64
wal_level | hot_standby
wal_receiver_status_interval | 1
wal_sender_timeout | 3000
work_mem | 16384
(53 rows)

Time: 1.955 ms
rpopdb04g/postgres R #

On replicas we issue select-queries that do index only scans on this table.

We would probably partition this table to reduce the size of the index so that btree_xlog_vacuum function would work faster. But perhaps there is anything else we can do right now?

Thanks in advance.

[0] http://www.postgresql.org/message-id/9358FF5E-0164-4508-A3DF-4A34A4C602A1@simply.name

--
May the force be with you…
https://simply.name

Browse pgsql-admin by date

  From Date Subject
Next Message naveen kumar 2015-03-25 22:54:55 ERROR: functions in index expression must be marked IMMUTABLE
Previous Message naveen kumar 2015-03-25 15:31:32 Re: pg_xlog-files not deleted on a standby after accidental server crash