BUG #15124: standby hung, recovering process stuck waiting, possible undetected deadlock

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: clinton(dot)adams(at)gmail(dot)com
Subject: BUG #15124: standby hung, recovering process stuck waiting, possible undetected deadlock
Date: 2018-03-21 18:42:23
Message-ID: 152165774326.9520.987533011190563481@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 15124
Logged by: Clinton Adams
Email address: clinton(dot)adams(at)gmail(dot)com
PostgreSQL version: 9.6.6
Operating system: CentOS Linux release 7.4.1708
Description:

One of our streaming standbys built up a wait queue behind the recovering
process, which rendered db inaccessible with all non-idle user processes
stuck in 'startup waiting'.

Some master log snippets:
2018-03-20 17:06:10.987 EDT,,,12954,,5a5ec246.329a,3015,,2018-01-16 22:25:58
EST,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2018-03-20 17:29:24.490 EDT,,,5197,,5ab17d34.144d,1,,2018-03-20 17:29:24
EDT,113/10510168,2787541257,LOG,00000,"automatic vacuum of table
""ateb.pg_catalog.pg_index"": index scans: 1
2018-03-20 17:30:19.876 EDT,,,12954,,5a5ec246.329a,3016,,2018-01-16 22:25:58
EST,,0,LOG,00000,"checkpoint complete: wrote 4045 buffers (0.4%); 0
transaction log file(s) added, 68 removed, 0 recycled; write=1448.033 s,
sync=0.318 s, total=1448.888 s; sync files=1513, longest=0.003 s,
average=0.000 s; distance=1800115 kB, estimate=23855600 kB",,,,,,,,,""

Standby has -
max_standby_streaming_delay = 4h
max_standby_archive_delay = 8h
hot_standby_feedback = off
wal_keep_segments = 2000
hot_standby = on

All instances have
synchronous_commit = off
commit_delay is not being used

Standby log showing first wait:
2018-03-20 17:29:34.588
EDT,"XXXXX","XXXXX",17933,"XX.XX.XX.XX:48164",5ab17d34.460d,1,"SELECT
waiting",2018-03-20 17:29:24 EDT,3/34326918,0,LOG,00000,"process 17933 still
waiting for AccessShareLock on relation 2610 of database 16403 after
10000.164 ms","Process holding the lock: 30493. Wait queue: 17933.",,,,,"

pid 30493 -
postgres 30493 1.0 17.4 9019072 8594764 ? Ss Jan16 942:24 postgres:
startup process recovering 0000000100009EFF00000035 waiting

relation 2610 is pg_index

Standby log showing wait queue:
2018-03-20 19:49:55.845
EDT,"XXXXX","XXXXX",9129,"XX.XX.XX.XX:57952",5ab18d0c.23a9,2,"startup
waiting",2018-03-20 18:37:00 EDT,177/101,0,LOG,00000,"process 9129 still
waiting for AccessShareLock on relation 2610 of database 16403 after
4375401.381 ms","Process holding the lock: 30493. Wait queue: 17933, 18459,
18515, 20148, 20196, 20446, 20460, 20474, 20475, 20476, 20477, 20478, 20479,
20890, 21076, 21104, 21234, 21493, 21505, 22736, 24478, 24744, 24757, 24894,
25140, 25152, 26389, 26396, 26397, 26398, 26400, 26399, 26401, 28159, 28433,
28455, 28580, 28850, 28862, 30119, 31929, 31931, 31943, 31956, 31968, 31969,
31970, 31971, 31972, 31974, 31988, 32213, 32260, 32272, 32411, 32617, 32678,
32691, 1590, 3466, 3480, 3492, 3777, 3789, 3929, 4192, 4204, 5515, 5524,
5523, 5525, 5526, 5527, 7377, 7390, 7403, 7687, 7700, 7824, 8102, 8115,
9368, 11277, 11293, 11309, 11666, 11705, 11718, 11843, 24149, 12132, 12144,
13473, 14743, 15002, 15393, 15405, 15418, 15711, 15723, 15860, 16134, 16146,
17468, 19406, 19424, 19440, 19732, 19745, 19883, 20157, 20169, 21513, 23420,
23434, 23447, 23753, 23794, 23806, 23944, 24218, 24230, 25568, 27492, 27506,
27519, 27825, 27851, 27973, 28260, 28272, 29625, 31564, 31578, 31591, 31877,
31890, 32025, 32305, 32317, 1227, 11021, 2268, 3091, 3121, 3135, 3422, 3461,
3475, 3607, 3883, 3895, 4719, 5058, 5461, 7073, 7086, 7099, 7384, 7399,
7523, 7803, 7815, 9129, 10997, 11015, 11029, 11316, 11330, 11458, 11737,
11749, 13057, 14927, 14940, 14959, 15290, 15302, 15437, 15706, 15719, 16275,
16653, 17029, 17044, 18521, 18869, 18887, 18900, 19165, 19178, 19303, 19569,
19582, 19950, 20874, 22702, 22714, 22727, 23003, 23015, 23146, 23408, 23421,
24686, 26545, 26599, 26612, 26938, 26951, 27075, 27355, 27367, 28248, 28703,
30479, 30495, 30511, 30791, 30792, 30929, 31175, 31187, 32482, 1852, 1873,
1886, 2151, 2164, 2289, 2555, 2567, 3727, 5612, 5626, 5640, 5961, 5973,
6110, 6356, 6368, 7647, 9407, 9425, 9438, 9704, 9716, 9840, 10099, 10111,
11388, 13169, 13183, 13195, 13460, 13476, 13612, 13857, 13869, 15161, 15413,
15444, 15758, 15774, 15816, 15817, 15819, 15820, 16785, 16975, 17002, 17015,
17336, 17349, 17474, 17746, 17758, 19060, 20888, 20903, 20916, 21193, 21208,
21331, 21602, 21614, 22905, 22980, 23255, 24320, 24958, 24977, 24989, 25294,
25306, 25437, 25729, 25741, 27109, 29150, 29169, 29186, 29499, 29539, 29551,
29676, 29930, 29973, 29985, 31364.",,,,,,,,"psql"

I confirmed that there were no idle in transactions on the standby. I could
not get access to db to query pg_locks view. Before restarting the standby
as last resort, there were 280 user processes in 'startup waiting', 7
'idle', and 3 in 'SELECT waiting'.

After restart of standby operations returned to normal, retrieving WAL
segments from archive and then going back to streaming. Other streaming
standbys did not show the problem.

An update to 9.6.8 is planned, but I do not see anything applicable to this
in the release notes. This is possibly related to "BUG #13657: Some kind of
undetected deadlock between query and "startup process" on replica.",
https://www.postgresql.org/message-id/20151001121745.368.58359%40wrigleys.postgresql.org

Browse pgsql-bugs by date

  From Date Subject
Next Message Keith Fiske 2018-03-21 20:30:00 Re: BUG #15123: pgAdmin 4 no error message
Previous Message Teodor Sigaev 2018-03-21 12:35:56 Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug