From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Non-reproducible AIO failure |
Date: | 2025-05-25 23:44:48 |
Message-ID: | CA+hUKGK2woMXTbG9xsuQ-d3o8N8du40F6tH9sAiKCY3eTN_VXQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, May 25, 2025 at 3:22 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> > Can you get a core and print *ioh in the debugger?
>
> So far, I've failed to get anything useful out of core files
> from this failure. The trace goes back no further than
>
> (lldb) bt
> * thread #1
> * frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8
>
> That's quite odd in itself: while I don't find the debugging
> environment on macOS to be the greatest, it's not normally
> this unhelpful.
(And Alexander reported the same off-list.). It's interesting that the
elog.c backtrace stuff is able to analyse the stack and it looks
normal AFAICS. Could that be interfering with the stack in the core?!
I doubt it ... I kinda wonder if the debugger might be confused about
libsystem sometimes since it has ceased to be a regular Mach-O file on
disk, but IDK; maybe gdb (from MacPorts etc) would offer a clue?
So far we have:
TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"),
File: "bufmgr.c", Line: 1605, PID: 20931
0 postgres 0x0000000105299c84
ExceptionalCondition + 108
1 postgres 0x00000001051159ac WaitReadBuffers + 616
2 postgres 0x00000001053611ec
read_stream_next_buffer.cold.1 + 184
3 postgres 0x0000000105111630
read_stream_next_buffer + 300
4 postgres 0x0000000104e0b994
heap_fetch_next_buffer + 136
5 postgres 0x0000000104e018f4
heapgettup_pagemode + 204
Hmm, looking around that code and wracking my brain for things that
might happen on one OS but not others, I wonder about partial I/Os.
Perhaps combined with some overlapping requests.
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c",
Line: 161, PID: 32355
0 postgres 0x0000000104f078f4
ExceptionalCondition + 236
1 postgres 0x0000000104c0ebd4
pgaio_io_before_start + 260
2 postgres 0x0000000104c0ea94
pgaio_io_start_readv + 36
3 postgres 0x0000000104c2d4e8 FileStartReadV + 252
4 postgres 0x0000000104c807c8 mdstartreadv + 668
5 postgres 0x0000000104c83db0 smgrstartreadv + 116
But this one seems like a more basic confusion... wild writes
somewhere? Hmm, we need to see what's in that struct.
If we can't get a debugger to break there or a core file to be
analysable, maybe we should try logging as much info as possible at
those points to learn a bit more? I would be digging like that myself
but I haven't seen this failure on my little M4 MacBook Air yet
(Sequoia 15.5, Apple clang-1700.0.13.3). It is infected with
corporate security-ware that intercepts at least file system stuff and
slows it down and I can't even convince it to dump core files right
now. Could you guys please share your exact repro steps?
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-05-25 23:59:00 | Re: Fixing memory leaks in postgres_fdw |
Previous Message | Dean Rasheed | 2025-05-25 20:10:44 | Re: MERGE issues around inheritance |