|From:||Andres Freund <andres(at)anarazel(dot)de>|
|To:||pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>|
|Subject:||failure in 019_replslot_limit|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
I just saw the following failure:
after a commit of mine. The symptoms look unrelated though.
[17:54:42.188](258.345s) # poll_query_until timed out executing this query:
# SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
# expecting this output:
# last actual query output:
# with stderr:
timed out waiting for slot to be lost at /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 400.
We're expecting "lost" but are getting "unreserved".
At first I though this was just a race - it's not guaranteed that a checkpoint
to remove the WAL files occurs anytime soon.
But there might be something else going on - in this case a checkpoint
started, but never finished:
2023-04-05 17:50:23.786 UTC  019_replslot_limit.pl LOG: statement: SELECT pg_switch_wal();
2023-04-05 17:50:23.787 UTC  LOG: checkpoints are occurring too frequently (2 seconds apart)
2023-04-05 17:50:23.787 UTC  HINT: Consider increasing the configuration parameter "max_wal_size".
2023-04-05 17:50:23.787 UTC  LOG: checkpoint starting: wal
2023-04-05 17:50:23.837 UTC  019_replslot_limit.pl LOG: statement: CREATE TABLE t ();
2023-04-05 17:50:23.839 UTC  019_replslot_limit.pl LOG: statement: DROP TABLE t;
2023-04-05 17:50:23.840 UTC  019_replslot_limit.pl LOG: statement: SELECT pg_switch_wal();
2023-04-05 17:50:23.841 UTC  LOG: terminating process 344783 to release replication slot "rep3"
2023-04-05 17:50:23.841 UTC  DETAIL: The slot's restart_lsn 0/7000D8 exceeds the limit by 1048360 bytes.
2023-04-05 17:50:23.841 UTC  HINT: You might need to increase max_slot_wal_keep_size.
2023-04-05 17:50:23.862 UTC  standby_3 FATAL: terminating connection due to administrator command
2023-04-05 17:50:23.862 UTC  standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1
2023-04-05 17:50:23.893 UTC  019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
[many repetitions of the above, just differing in time and pid]
2023-04-05 17:54:42.084 UTC  019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3'
2023-04-05 17:54:42.200 UTC  LOG: received immediate shutdown request
2023-04-05 17:54:42.229 UTC  LOG: database system is shut down
Note that a checkpoint started at "17:50:23.787", but didn't finish before the
database was shut down. As far as I can tell, this can not be caused by
checkpoint_timeout, because by the time we get to invalidating replication
slots, we already did CheckPointBuffers(), and that's the only thing that
delays based on checkpoint_timeout.
ISTM that this indicates that checkpointer got stuck after signalling
Do you see any other explanation?
|Next Message||Andres Freund||2023-04-05 18:55:14||Re: failure in 019_replslot_limit|
|Previous Message||Daniel Gustafsson||2023-04-05 18:43:07||Re: GUC for temporarily disabling event triggers|