From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com>, Alexander Korotkov <akorotkov(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org |
Subject: | Re: pgsql: Improve runtime and output of tests for replication slots checkp |
Date: | 2025-06-20 23:28:32 |
Message-ID: | CAPpHfdtmLNpswWzk2qQ7-nMBLod2GrRWwMvfb0hc7tQfDD--ZQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers |
On Sat, Jun 21, 2025 at 1:40 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:
> On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > I wrote:
> > > But in the buildfarm failures I don't see any 'checkpoint complete'
> > > before the shutdown.
> >
> > Ooops, I lied: we have at least one case where the checkpoint does
> > finish but then it hangs up anyway:
> >
> >
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
> >
> > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG:
statement: select
injection_points_wakeup('checkpoint-before-old-wal-removal')
> > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG:
checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL
file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s,
total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s;
distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
> > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG:
disconnection: session time: 0:00:00.073 user=bf database=postgres
host=[local]
> > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG:
disconnection: session time: 0:00:01.538 user=bf database=postgres
host=[local]
> > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG:
statement: SELECT 1
> > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received
immediate shutdown request
> > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database
system is shut down
> >
> > I'm still confused about whether the 046 script intends to
> > sometimes test cases where the checkpoint doesn't have time to
> > complete. But it seems that whatever is going on here
> > is a bit subtle and platform-dependent.
>
> I think this indicates unfinished intention to wait for checkpoint
> completion. But I think both cases (checkpoint finished and
> unfinished) should work correctly. So, I believe there is a backend
> problem. I'm trying to reproduce this locally. Sorry for the
> confusion.
I manage to reproduce the bug locally. I've got the following backtrace.
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000019e6fd3c8 libsystem_kernel.dylib`__semwait_signal + 8
frame #1: 0x000000019e5de568 libsystem_c.dylib`nanosleep + 220
frame #2: 0x0000000103009628 postgres`pg_usleep(microsec=1000) at
pgsleep.c:50:10
frame #3: 0x00000001027bd248
postgres`read_local_xlog_page_guts(state=0x000000011383ed58,
targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360,
cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4
frame #4: 0x00000001027bd130
postgres`read_local_xlog_page(state=0x000000011383ed58,
targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360,
cur_page="\U00000018\xd1\U00000005") at xlogutils.c:848:9
frame #5: 0x00000001027ae184
postgres`ReadPageInternal(state=0x000000011383ed58, pageptr=687882240,
reqLen=277) at xlogreader.c:1061:12
frame #6: 0x00000001027acd14
postgres`XLogDecodeNextRecord(state=0x000000011383ed58, nonblocking=false)
at xlogreader.c:727:14
frame #7: 0x00000001027ac6bc
postgres`XLogReadAhead(state=0x000000011383ed58, nonblocking=false) at
xlogreader.c:974:11
frame #8: 0x00000001027ac5c0
postgres`XLogReadRecord(state=0x000000011383ed58,
errormsg=0x000000016d798dd0) at xlogreader.c:405:3
frame #9: 0x0000000102c1d1d0
postgres`pg_logical_slot_get_changes_guts(fcinfo=0x0000000113830700,
confirm=true, binary=false) at logicalfuncs.c:257:13
frame #10: 0x0000000102c1ca00
postgres`pg_logical_slot_get_changes(fcinfo=0x0000000113830700) at
logicalfuncs.c:333:9
frame #11: 0x00000001029e1adc
postgres`ExecMakeTableFunctionResult(setexpr=0x0000000113825440,
econtext=0x00000001138252e0, argContext=0x0000000113830600,
expectedDesc=0x000000011382ecb0, randomAccess=false) at execSRF.c:234:13
While looking at frame 5 I found the following code to be executed.
targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
targetRecOff = RecPtr % XLOG_BLCKSZ;
/*
* Read the page containing the record into state->readBuf. Request enough
* byte to cover the whole record header, or at least the part of it that
* fits on the same page.
*/
readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
And I see the following variable values.
(lldb) p/x targetPagePtr
(XLogRecPtr) 0x0000000029004000
(lldb) p/x RecPtr
(XLogRecPtr) 0x0000000029002138
I hardly understand how is this possible given it was compiled with "-O0".
I'm planning to continue investigating this tomorrow.
------
Regards,
Alexander Korotkov
Supabase
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-06-20 23:30:58 | Re: pgsql: Improve runtime and output of tests for replication slots checkp |
Previous Message | Alexander Korotkov | 2025-06-20 22:40:46 | Re: pgsql: Improve runtime and output of tests for replication slots checkp |