From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Non-reproducible AIO failure |
Date: | 2025-05-24 21:00:00 |
Message-ID: | f975aa25-eddd-45ca-90b5-855338ed75d2@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Thomas,
24.05.2025 14:42, Thomas Munro wrote:
> On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> So it seems that "very low-probability issue in our Mac AIO code" is
>> the most probable description.
> There isn't any macOS-specific AIO code so my first guess would be
> that it might be due to aarch64 weak memory reordering (though Andres
> speculated that itt should all be one backend, huh), if it's not just
> a timing luck thing. Alexander, were the other OSes you tried all on
> x86?
As I wrote off-list before, I had tried x86_64 only, but since then I
tried to reproduce the issue on an aarch64 server with Ubuntu 24.04,
running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've
also multiplied "test: brin ..." line x10. But the issue is still not
reproduced (in 8+ hours).
However, I've managed to get an AIO-related assertion failure on macOS 14.5
(Darwin Kernel Version 23.5.0) quite easily, running 027_stream_regress.pl
the same way:
2 # +++ tap check in src/test/recovery_2 +++
3 # +++ tap check in src/test/recovery_3 +++
4 # +++ tap check in src/test/recovery_4 +++
5 # +++ tap check in src/test/recovery_5 +++
1 # +++ tap check in src/test/recovery_1 +++
7 # +++ tap check in src/test/recovery_7 +++
6 # +++ tap check in src/test/recovery_6 +++
8 # +++ tap check in src/test/recovery_8 +++
1 [16:07:31] t/027_stream_regress.pl ..
1 Dubious, test returned 29 (wstat 7424, 0x1d00)
1 All 1 subtests passed
1 [16:07:31]
1
1 Test Summary Report
1 -------------------
1 t/027_stream_regress.pl (Wstat: 7424 Tests: 1 Failed: 0)
1 Non-zero exit status: 29
1 Parse errors: No plan found in TAP output
1 Files=1, Tests=1, 197 wallclock secs ( 0.01 usr 0.01 sys + 4.47 cusr 5.79 csys = 10.28 CPU)
1 Result: FAIL
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355
0 postgres 0x0000000104f078f4 ExceptionalCondition + 236
1 postgres 0x0000000104c0ebd4 pgaio_io_before_start + 260
2 postgres 0x0000000104c0ea94 pgaio_io_start_readv + 36
3 postgres 0x0000000104c2d4e8 FileStartReadV + 252
4 postgres 0x0000000104c807c8 mdstartreadv + 668
5 postgres 0x0000000104c83db0 smgrstartreadv + 116
6 postgres 0x0000000104c1675c AsyncReadBuffers + 1444
7 postgres 0x0000000104c15868 StartReadBuffersImpl + 1196
8 postgres 0x0000000104c153ac StartReadBuffers + 64
9 postgres 0x0000000104c129b4 read_stream_start_pending_read + 1204
10 postgres 0x0000000104c12018 read_stream_look_ahead + 812
11 postgres 0x0000000104c11cd0 read_stream_next_buffer + 1396
12 postgres 0x0000000104606c38 heap_fetch_next_buffer + 284
13 postgres 0x00000001045f79d4 heapgettup_pagemode + 192
14 postgres 0x00000001045f7fa4 heap_getnextslot + 84
15 postgres 0x000000010497a404 table_scan_getnextslot + 340
16 postgres 0x000000010497a210 SeqNext + 148
17 postgres 0x000000010497a8e4 ExecScanFetch + 772
18 postgres 0x000000010497a4b8 ExecScanExtended + 164
19 postgres 0x0000000104979bf0 ExecSeqScanWithProject + 244
20 postgres 0x0000000104953ce0 ExecProcNode + 68
21 postgres 0x000000010494f148 MultiExecPrivateHash + 64
22 postgres 0x000000010494ed48 MultiExecHash + 100
23 postgres 0x0000000104925a18 MultiExecProcNode + 180
24 postgres 0x0000000104957220 ExecHashJoinImpl + 628
25 postgres 0x00000001049566dc ExecHashJoin + 32
26 postgres 0x0000000104925958 ExecProcNodeFirst + 100
27 postgres 0x000000010491c08c ExecProcNode + 68
28 postgres 0x000000010491692c ExecutePlan + 268
29 postgres 0x000000010491679c standard_ExecutorRun + 504
30 pg_stat_statements.dylib 0x00000001055e132c pgss_ExecutorRun + 212
31 postgres 0x0000000104916580 ExecutorRun + 72
32 postgres 0x000000010491f59c ParallelQueryMain + 508
33 postgres 0x00000001046968d4 ParallelWorkerMain + 1712
34 postgres 0x0000000104b476f4 BackgroundWorkerMain + 824
35 postgres 0x0000000104b4b87c postmaster_child_launch + 492
36 postgres 0x0000000104b56610 StartBackgroundWorker + 416
37 postgres 0x0000000104b51378 maybe_start_bgworkers + 552
38 postgres 0x0000000104b540cc LaunchMissingBackgroundProcesses + 1316
39 postgres 0x0000000104b51650 ServerLoop + 616
40 postgres 0x0000000104b4fef0 PostmasterMain + 6632
41 postgres 0x00000001049bfe20 main + 952
42 dyld 0x00000001849d60e0 start + 2360
2025-05-24 16:07:23.427 EDT [22120:4] LOG: background worker "parallel worker" (PID 32355) was terminated by signal 6:
Abort trap: 6
2025-05-24 16:07:23.427 EDT [22120:5] DETAIL: Failed process was running: select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
I'm yet to see the Assert triggered on the buildfarm, but this one looks
interesting too.
(I can share the complete patch + script for such testing, if it can be helpful.)
Best regards,
Alexander Lakhin
Neon (https://neon.tech)
From | Date | Subject | |
---|---|---|---|
Next Message | David E. Wheeler | 2025-05-24 21:39:01 | Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part |
Previous Message | Dmitry Dolgov | 2025-05-24 19:20:27 | Re: Automatically sizing the IO worker pool |