From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, greg(at)burd(dot)me |
Subject: | Re: IO in wrong state on riscv64 |
Date: | 2025-10-12 05:00:00 |
Message-ID: | 60bbebf4-0a80-4e5f-83c0-311c85f6bd5c@gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Thomas,
12.10.2025 06:35, Thomas Munro wrote:
> On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>> 2025-10-11 11:34:46.793 GMT [1169773:1] PANIC: !!!pgaio_io_wait| ioh->state changed from 0 to 1 at iteration 0
>> # no other iteration number observed
> Can you please disassemble pgaio_io_update_state() and
> pgaio_io_was_recycled()? I wonder if the memory barriers are not
> being generated correctly, causing the state and generation to be
> loaded out of order, or something like that...
Please find those attached (gdb "disass/m pgaio_io_update_state" misses
the start of the function (but it's still disassembled below), so I
decided to share the whole output).
This is from clean master, without any modifications, but with the issue
confirmed for this build:
2025-10-11 20:29:11.724 UTC [1679534:1] [unknown] LOG: connection received: host=[local]
2025-10-11 20:29:11.725 UTC [1679536:1] [unknown] LOG: connection received: host=[local]
2025-10-11 20:29:11.726 UTC [1679538:1] [unknown] LOG: connection received: host=[local]
2025-10-11 20:29:11.724 UTC [1679533:1] [unknown] LOG: connection received: host=[local]
2025-10-11 20:29:11.724 UTC [1679537:1] [unknown] LOG: connection received: host=[local]
2025-10-11 20:29:11.724 UTC [1679535:1] [unknown] LOG: connection received: host=[local]
2025-10-11 20:29:11.729 UTC [1679539:1] [unknown] LOG: connection received: host=[local
...
2025-10-11 20:29:11.778 UTC [1679537:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_schema
2025-10-11 20:29:11.778 UTC [1679533:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_type
2025-10-11 20:29:11.778 UTC [1679539:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_procedure
2025-10-11 20:29:11.778 UTC [1679536:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_misc
2025-10-11 20:29:11.778 UTC [1679538:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_table
2025-10-11 20:29:11.778 UTC [1679535:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_function_c
2025-10-11 20:29:11.790 UTC [1679534:2] [unknown] FATAL: IO in wrong state: 0
2025-10-11 20:29:11.790 UTC [1679534:3] [unknown] BACKTRACE:
postgres: debian regression [local] authentication(+0x4371ec) [0x555565cd61ec]
postgres: debian regression [local] authentication(+0x442bb2) [0x555565ce1bb2]
postgres: debian regression [local] authentication(StartBufferIO+0x66) [0x555565ce1a10]
postgres: debian regression [local] authentication(+0x43df60) [0x555565cdcf60]
postgres: debian regression [local] authentication(StartReadBuffer+0x308) [0x555565cdc8a4]
postgres: debian regression [local] authentication(ReadBufferExtended+0x64) [0x555565cdaace]
...
postgres: debian regression [local] authentication(postmaster_child_launch+0x132) [0x555565c7787c]
postgres: debian regression [local] authentication(+0x3dcd52) [0x555565c7bd52]
postgres: debian regression [local] authentication(InitProcessGlobals+0) [0x555565c79c96]
postgres: debian regression [local] authentication(+0x3194b2) [0x555565bb84b2]
/lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffa690891c]
/lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) [0x7fffa69089c4]
postgres: debian regression [local] authentication(_start+0x20) [0x5555659842c8]
> The previous failure on greenfly was a TIMEOUT in the same test, as if
> a query was hanging.
Yeah, I'll try to reproduce it too...
> On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>> I've managed to reproduce it using qemu-system-riscv64 with Debian trixie
> Huh, that's interesting. What is the host architecture? When I saw
> that error myself and wondered about memory order, I dismissed the
> idea of trying with qemu, figuring that my x86 host's TSO would affect
> the coherency, but thinking again about that... I guess the compiler
> might still reorder during riscv codegen if there is something wrong
> with the barrier support, and even if it doesn't, the binary
> translation to x86 might also feel free to reorder stuff if there are
> no barrier instructions to prevent it? Or maybe that doesn't happen
> but your host is ARM?
I use AMD Ryzen 9 7900X, Ubuntu 24.04 and run the risc machine with:
qemu-system-riscv64 -machine virt -m 1G -smp 8 -cpu rv64 -device virtio-blk-device,drive=hd \
-drive file=.../dqib_riscv64-virt/image.qcow2,if=none,id=hd -device virtio-net-device,netdev=net \
-netdev user,id=net,hostfwd=tcp::22222-:22 -bios /usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
-kernel /usr/lib/u-boot/qemu-riscv64_smode/uboot.elf -object rng-random,filename=/dev/urandom,id=rng \
-device virtio-rng-device,rng=rng -nographic -append "root=LABEL=rootfs console=ttyS0"
I don't know what hardware greenfly uses (CC'ing Greg in case he'd like to
share some info on this), but timings are similar to what I'm seeing:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=greenfly&dt=2025-10-09%2000%3A06%3A03&stg=check
ok 160 - select_parallel 5019 ms
`make check` on mine, with select_parallel repeated:
ok 160 - select_parallel 6042 ms
ok 161 - select_parallel 6089 ms
ok 162 - select_parallel 6116 ms
copperhead and boomslang, which are using real RISC hardware [1], show
better timings:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=copperhead&dt=2025-10-11%2019%3A36%3A33&stg=check
ok 160 - select_parallel 2677 ms
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=boomslang&dt=2025-10-12%2002%3A17%3A53&stg=check
ok 160 - select_parallel 3651 ms
Thank you for looking into this!
[1] https://www.postgresql.org/message-id/3db97903-884f-4b0c-b1cd-d7442e71ea75%40app.fastmail.com
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
pgaio_io_update_state.asm | text/plain | 24.4 KB |
pgaio_io_was_recycled.asm | text/plain | 1.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Hayato Kuroda (Fujitsu) | 2025-10-12 05:59:14 | RE: [PATCH TEST] Fix logical replication setup in subscription test `t/009_matviews.pl` |
Previous Message | Pavel Stehule | 2025-10-12 04:57:56 | Re: proposal: schema variables |