Re: pgsql: Improve runtime and output of tests for replication slots checkp

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Alexander Korotkov <akorotkov(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date: 2025-06-20 22:03:05
Message-ID: 2542023.1750456985@sss.pgh.pa.us
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Melanie Plageman <melanieplageman(at)gmail(dot)com> writes:
> Quite a few animals have started failing since this commit (for example
> [1]) . I haven't looked into why, but I suspect something is wrong.

It looks to me like it's being triggered by this questionable bit in
046_checkpoint_logical_slot.pl:

# Continue the checkpoint.
$node->safe_psql('postgres',
q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});

# Abruptly stop the server (1 second should be enough for the checkpoint
# to finish; it would be better).
$node->stop('immediate');

That second comment is pretty unintelligible, but I think it's
expecting that we'd give the checkpoint 1 second to complete,
which the code is *not* doing. On my own machine it looks like the
checkpoint does manage to complete within about 1ms, just barely
before the shutdown arrives:

2025-06-20 17:52:25.599 EDT [2538690] 046_checkpoint_logical_slot.pl LOG: statement: select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())
2025-06-20 17:52:25.602 EDT [2538692] 046_checkpoint_logical_slot.pl LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
2025-06-20 17:52:25.603 EDT [2538557] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=1.074 s; sync files=0, longest=0.000 s, average=0.000 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
2025-06-20 17:52:25.604 EDT [2538553] LOG: received immediate shutdown request

But in the buildfarm failures I don't see any 'checkpoint complete'
before the shutdown.

If this is an accurate diagnosis then it indicates both a test bug
(it should delay here, or else the comment needs fixed to explain
what we're actually testing) and a backend bug, because an immediate
stop a/k/a crash before completing the checkpoint should not lead to
failure to function after the next restart.

regards, tom lane

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2025-06-20 22:24:58 Re: pgsql: Improve runtime and output of tests for replication slots checkp
Previous Message Tom Lane 2025-06-20 19:56:33 pgsql: Remove planner's have_dangerous_phv() join-order restriction.