Re: pgsql: Improve runtime and output of tests for replication slots checkp

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Alexander Korotkov <akorotkov(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date: 2025-06-20 23:28:32
Message-ID: CAPpHfdtmLNpswWzk2qQ7-nMBLod2GrRWwMvfb0hc7tQfDD--ZQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-committers

On Sat, Jun 21, 2025 at 1:40 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:
> On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > I wrote:
> > > But in the buildfarm failures I don't see any 'checkpoint complete'
> > > before the shutdown.
> >
> > Ooops, I lied: we have at least one case where the checkpoint does
> > finish but then it hangs up anyway:
> >
> >
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
> >
> > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG:
statement: select
injection_points_wakeup('checkpoint-before-old-wal-removal')
> > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG:
checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s,
total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s;
distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
> > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG:
disconnection: session time: 0:00:00.073 user=bf database=postgres
host=[local]
> > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG:
disconnection: session time: 0:00:01.538 user=bf database=postgres
host=[local]
> > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG:
statement: SELECT 1
> > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received
immediate shutdown request
> > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database
system is shut down
> >
> > I'm still confused about whether the 046 script intends to
> > sometimes test cases where the checkpoint doesn't have time to
> > complete. But it seems that whatever is going on here
> > is a bit subtle and platform-dependent.
>
> I think this indicates unfinished intention to wait for checkpoint
> completion. But I think both cases (checkpoint finished and
> unfinished) should work correctly. So, I believe there is a backend
> problem. I'm trying to reproduce this locally. Sorry for the
> confusion.

I manage to reproduce the bug locally. I've got the following backtrace.

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000019e6fd3c8 libsystem_kernel.dylib`__semwait_signal + 8
frame #1: 0x000000019e5de568 libsystem_c.dylib`nanosleep + 220
frame #2: 0x0000000103009628 postgres`pg_usleep(microsec=1000) at
pgsleep.c:50:10
frame #3: 0x00000001027bd248
postgres`read_local_xlog_page_guts(state=0x000000011383ed58,
targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360,
cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4
frame #4: 0x00000001027bd130
postgres`read_local_xlog_page(state=0x000000011383ed58,
targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360,
cur_page="\U00000018\xd1\U00000005") at xlogutils.c:848:9
frame #5: 0x00000001027ae184
postgres`ReadPageInternal(state=0x000000011383ed58, pageptr=687882240,
reqLen=277) at xlogreader.c:1061:12
frame #6: 0x00000001027acd14
postgres`XLogDecodeNextRecord(state=0x000000011383ed58, nonblocking=false)
at xlogreader.c:727:14
frame #7: 0x00000001027ac6bc
postgres`XLogReadAhead(state=0x000000011383ed58, nonblocking=false) at
xlogreader.c:974:11
frame #8: 0x00000001027ac5c0
postgres`XLogReadRecord(state=0x000000011383ed58,
errormsg=0x000000016d798dd0) at xlogreader.c:405:3
frame #9: 0x0000000102c1d1d0
postgres`pg_logical_slot_get_changes_guts(fcinfo=0x0000000113830700,
confirm=true, binary=false) at logicalfuncs.c:257:13
frame #10: 0x0000000102c1ca00
postgres`pg_logical_slot_get_changes(fcinfo=0x0000000113830700) at
logicalfuncs.c:333:9
frame #11: 0x00000001029e1adc
postgres`ExecMakeTableFunctionResult(setexpr=0x0000000113825440,
econtext=0x00000001138252e0, argContext=0x0000000113830600,
expectedDesc=0x000000011382ecb0, randomAccess=false) at execSRF.c:234:13

While looking at frame 5 I found the following code to be executed.

targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
targetRecOff = RecPtr % XLOG_BLCKSZ;

/*
* Read the page containing the record into state->readBuf. Request enough
* byte to cover the whole record header, or at least the part of it that
* fits on the same page.
*/
readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));

And I see the following variable values.

(lldb) p/x targetPagePtr
(XLogRecPtr) 0x0000000029004000
(lldb) p/x RecPtr
(XLogRecPtr) 0x0000000029002138

I hardly understand how is this possible given it was compiled with "-O0".
I'm planning to continue investigating this tomorrow.

------
Regards,
Alexander Korotkov
Supabase

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2025-06-20 23:30:58 Re: pgsql: Improve runtime and output of tests for replication slots checkp
Previous Message Alexander Korotkov 2025-06-20 22:40:46 Re: pgsql: Improve runtime and output of tests for replication slots checkp