From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Nico Williams <nico(at)cryptonector(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, rmt(at)lists(dot)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Subject: | Re: Non-reproducible AIO failure |
Date: | 2025-08-20 18:00:00 |
Message-ID: | c4794f6c-833a-4e2d-a027-2bb48810e5e1@gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Andres,
22.07.2025 02:19, Andres Freund wrote:
> Hi,
>
> On 2025-06-19 10:16:12 -0500, Nico Williams wrote:
>> On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote:
>>> I also dug out an archeologically old MacBook Pro running macOS High Sierra
>>> 10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4), and it
>>> too fails to reproduce any issue.
>> It's not going to be reproducible on x86_64 macs. It has to be ARM
>> macs.
> In addition to various other folks that had reported in, I had various macs,
> both real and virtualized, run this test in a loop. For weeks - without
> reproducing the issue once [1]. Since this can only be reproduced on
> Alexander and Konstantin's laptops, running an older macos [2] and running
> MDM, I'm for now considering this issue to not be reproducible outside of that
> environment. That's deeply unsatisfying, but I just don't see what else I
> could do.
I got access to that M1 laptop the past week and reproduced the failure
again. Then I just copied all the postgres directory to my MacBook M4 Pro
and ran the same tests there. And the issue reproduced! This fact made me
believe that that's not the old CPU-specific issue. Armed with this belief,
I managed to reproduce this with a build made on M4 too.
So for now I have pretty stable reproducer which I'd like to share (please
forgive me some rough edges):
git apply AIO-debugging.patch
CC=/opt/homebrew/Cellar/llvm(at)17/17.0.6/bin/clang-17 CFLAGS="" ./configure --enable-cassert --enable-debug
--enable-tap-tests --without-icu --without-zlib --prefix=/tmp/postgres -q && make -s -j8 && make -s install && make -s check
for i in {1..10}; do np=$((20 + $RANDOM % 10)); echo "iteration $i: $np"; time parallel -j40 --linebuffer --tag
/tmp/repro-AIO-Assert.sh {} ::: `seq $np` || break; sleep $(($RANDOM % 20)); done; echo -e "\007"
(For M1 it works with ~6 servers.)
The script executes opr_sanity.sql from the regress suite, but it can be
reduced, of course. Though reducing the number of queries seemingly leads
to decrease of the failure's probability, so I decided to leave it as-is
for now. Anyway, I can work on simplifying the repro, if needed.
This fails for me as below:
Run 1:
iteration 3: 29
...
24 TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 56420
...
real 12m3.018s
server24 log contains:
!!!AsyncReadBuffers [56420] (1)| blocknum: 73, ioh: 0x102588640, ioh->op: 0, ioh->state: 1, ioh->target: 0, ioh->result:
0, ioh->num_callbacks: 0, ioh->generation: 16293
!!!AsyncReadBuffers [56420] before smgrstartreadv| blocknum: 73, ioh: 0x102588640, ioh->op: 170, pre_op: 170,
ioh->state: 1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 1, ioh->generation: 16293
2025-08-20 19:20:23.121 EEST [56420] LOG: !!!FileStartReadV| ioh: 0x102588640, ioh->op: 170, ioh->state: 1,
ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 16293
2025-08-20 19:20:23.121 EEST [56420] STATEMENT: SELECT DISTINCT p1.proargtypes[7]::regtype, p2.proargtypes[7]::regtype
FROM pg_proc AS p1, pg_proc AS p2
WHERE p1.oid != p2.oid AND
p1.prosrc = p2.prosrc AND
p1.prolang = 12 AND p2.prolang = 12 AND
p1.prokind != 'a' AND p2.prokind != 'a' AND
(p1.proargtypes[7] < p2.proargtypes[7])
ORDER BY 1, 2;
2025-08-20 19:20:23.121 EEST [56420] LOG: !!!pgaio_io_before_start| ioh: 0x102588640, ioh->op: 170, ioh->state: 1,
ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 16293
2025-08-20 19:20:23.121 EEST [56420] STATEMENT: SELECT DISTINCT p1.proargtypes[7]::regtype, p2.proargtypes[7]::regtype
FROM pg_proc AS p1, pg_proc AS p2
WHERE p1.oid != p2.oid AND
p1.prosrc = p2.prosrc AND
p1.prolang = 12 AND p2.prolang = 12 AND
p1.prokind != 'a' AND p2.prokind != 'a' AND
(p1.proargtypes[7] < p2.proargtypes[7])
ORDER BY 1, 2;
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 56420
Run 2:
iteration 3: 21
...
21 TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 378
...
real 8m32.370s
server21.log contains:
!!!pgaio_io_reclaim [378]| ioh: 0x103f533d0, ioh->op: 0, ioh->state: 0, ioh->target: 0, ioh->generation: 13609
!!!pgaio_io_acquire_nb [378]| ioh: 0x103f533d0, ioh->op: 170, pre_op: 170
!!!AsyncReadBuffers [378] (1)| blocknum: 55, ioh: 0x103f533d0, ioh->op: 170, ioh->state: 1, ioh->target: 0, ioh->result:
0, ioh->num_callbacks: 0, ioh->generation: 13609
!!!AsyncReadBuffers [378] before smgrstartreadv| blocknum: 55, ioh: 0x103f533d0, ioh->op: 170, pre_op: 170, ioh->state:
1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 1, ioh->generation: 13609
!!!pgaio_io_reclaim [379]| ioh: 0x103f53340, ioh->op: 0, ioh->state: 1, ioh->target: 0, ioh->result: 0,
ioh->num_callbacks: 0, ioh->generation: 13457
!!!pgaio_io_reclaim [379]| ioh: 0x103f53340, ioh->op: 0, ioh->state: 0, ioh->target: 0, ioh->generation: 13458
2025-08-20 19:53:56.726 EEST [378] LOG: !!!FileStartReadV| ioh: 0x103f533d0, ioh->op: 170, ioh->state: 1, ioh->target:
1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 13609
2025-08-20 19:53:56.726 EEST [378] STATEMENT: SELECT p1.oid, p1.proname
FROM pg_proc as p1
WHERE p1.prolang = 0 OR p1.prorettype = 0 OR
p1.pronargs < 0 OR
p1.pronargdefaults < 0 OR
p1.pronargdefaults > p1.pronargs OR
array_lower(p1.proargtypes, 1) != 0 OR
array_upper(p1.proargtypes, 1) != p1.pronargs-1 OR
0::oid = ANY (p1.proargtypes) OR
procost <= 0 OR
CASE WHEN proretset THEN prorows <= 0 ELSE prorows != 0 END OR
prokind NOT IN ('f', 'a', 'w', 'p') OR
provolatile NOT IN ('i', 's', 'v') OR
proparallel NOT IN ('s', 'r', 'u');
2025-08-20 19:53:56.726 EEST [378] LOG: !!!pgaio_io_before_start| ioh: 0x103f533d0, ioh->op: 170, ioh->state: 1,
ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 13609
2025-08-20 19:53:56.726 EEST [378] STATEMENT: SELECT p1.oid, p1.proname
FROM pg_proc as p1
WHERE p1.prolang = 0 OR p1.prorettype = 0 OR
p1.pronargs < 0 OR
p1.pronargdefaults < 0 OR
p1.pronargdefaults > p1.pronargs OR
array_lower(p1.proargtypes, 1) != 0 OR
array_upper(p1.proargtypes, 1) != p1.pronargs-1 OR
0::oid = ANY (p1.proargtypes) OR
procost <= 0 OR
CASE WHEN proretset THEN prorows <= 0 ELSE prorows != 0 END OR
prokind NOT IN ('f', 'a', 'w', 'p') OR
provolatile NOT IN ('i', 's', 'v') OR
proparallel NOT IN ('s', 'r', 'u');
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 378
Run 3:
iteration 3: 29
...
2 TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 39023
...
real 11m51.809s
server2.log contains:
!!!pgaio_io_reclaim [39023]| ioh: 0x1045ff220, ioh->op: 0, ioh->state: 0, ioh->target: 0, ioh->generation: 5268
!!!pgaio_io_acquire_nb [39023]| ioh: 0x1045ff220, ioh->op: 170, pre_op: 170
!!!AsyncReadBuffers [39023] (1)| blocknum: 67, ioh: 0x1045ff220, ioh->op: 170, ioh->state: 1, ioh->target: 0,
ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 5268
!!!AsyncReadBuffers [39023] before smgrstartreadv| blocknum: 67, ioh: 0x1045ff220, ioh->op: 170, pre_op: 170,
ioh->state: 1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 1, ioh->generation: 5268
2025-08-20 20:24:41.767 EEST [39023] LOG: !!!FileStartReadV| ioh: 0x1045ff220, ioh->op: 170, ioh->state: 1,
ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 5268
2025-08-20 20:24:41.767 EEST [39023] STATEMENT: SELECT p1.oid, p1.proname
FROM pg_proc as p1 LEFT JOIN pg_description as d
ON p1.tableoid = d.classoid and p1.oid = d.objoid and d.objsubid = 0
WHERE d.classoid IS NULL AND p1.oid <= 9999;
2025-08-20 20:24:41.767 EEST [39023] LOG: !!!pgaio_io_before_start| ioh: 0x1045ff220, ioh->op: 170, ioh->state: 1,
ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 5268
2025-08-20 20:24:41.767 EEST [39023] STATEMENT: SELECT p1.oid, p1.proname
FROM pg_proc as p1 LEFT JOIN pg_description as d
ON p1.tableoid = d.classoid and p1.oid = d.objoid and d.objsubid = 0
WHERE d.classoid IS NULL AND p1.oid <= 9999;
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 39023
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
AIO-debugging.patch | text/x-patch | 6.6 KB |
repro=AIO-Assert.sh.txt | text/plain | 1.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kirk Wolak | 2025-08-20 18:52:19 | Re: Cygwin support |
Previous Message | Masahiko Sawada | 2025-08-20 17:29:28 | Re: Logical Replication of sequences |