From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Cc: | Alexander Korotkov <akorotkov(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org |
Subject: | Re: pgsql: Improve runtime and output of tests for replication slots checkp |
Date: | 2025-06-20 22:03:05 |
Message-ID: | 2542023.1750456985@sss.pgh.pa.us |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers |
Melanie Plageman <melanieplageman(at)gmail(dot)com> writes:
> Quite a few animals have started failing since this commit (for example
> [1]) . I haven't looked into why, but I suspect something is wrong.
It looks to me like it's being triggered by this questionable bit in
046_checkpoint_logical_slot.pl:
# Continue the checkpoint.
$node->safe_psql('postgres',
q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
# Abruptly stop the server (1 second should be enough for the checkpoint
# to finish; it would be better).
$node->stop('immediate');
That second comment is pretty unintelligible, but I think it's
expecting that we'd give the checkpoint 1 second to complete,
which the code is *not* doing. On my own machine it looks like the
checkpoint does manage to complete within about 1ms, just barely
before the shutdown arrives:
2025-06-20 17:52:25.599 EDT [2538690] 046_checkpoint_logical_slot.pl LOG: statement: select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())
2025-06-20 17:52:25.602 EDT [2538692] 046_checkpoint_logical_slot.pl LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
2025-06-20 17:52:25.603 EDT [2538557] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=1.074 s; sync files=0, longest=0.000 s, average=0.000 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
2025-06-20 17:52:25.604 EDT [2538553] LOG: received immediate shutdown request
But in the buildfarm failures I don't see any 'checkpoint complete'
before the shutdown.
If this is an accurate diagnosis then it indicates both a test bug
(it should delay here, or else the comment needs fixed to explain
what we're actually testing) and a backend bug, because an immediate
stop a/k/a crash before completing the checkpoint should not lead to
failure to function after the next restart.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-06-20 22:24:58 | Re: pgsql: Improve runtime and output of tests for replication slots checkp |
Previous Message | Tom Lane | 2025-06-20 19:56:33 | pgsql: Remove planner's have_dangerous_phv() join-order restriction. |