Re: IO in wrong state on riscv64

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, greg(at)burd(dot)me
Subject: Re: IO in wrong state on riscv64
Date: 2025-10-12 05:00:00
Message-ID: 60bbebf4-0a80-4e5f-83c0-311c85f6bd5c@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Thomas,

12.10.2025 06:35, Thomas Munro wrote:
> On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>> 2025-10-11 11:34:46.793 GMT [1169773:1] PANIC: !!!pgaio_io_wait| ioh->state changed from 0 to 1 at iteration 0
>> # no other iteration number observed
> Can you please disassemble pgaio_io_update_state() and
> pgaio_io_was_recycled()? I wonder if the memory barriers are not
> being generated correctly, causing the state and generation to be
> loaded out of order, or something like that...

Please find those attached (gdb "disass/m pgaio_io_update_state" misses
the start of the function (but it's still disassembled below), so I
decided to share the whole output).
This is from clean master, without any modifications, but with the issue
confirmed for this build:
2025-10-11 20:29:11.724 UTC [1679534:1] [unknown] LOG:  connection received: host=[local]
2025-10-11 20:29:11.725 UTC [1679536:1] [unknown] LOG:  connection received: host=[local]
2025-10-11 20:29:11.726 UTC [1679538:1] [unknown] LOG:  connection received: host=[local]
2025-10-11 20:29:11.724 UTC [1679533:1] [unknown] LOG:  connection received: host=[local]
2025-10-11 20:29:11.724 UTC [1679537:1] [unknown] LOG:  connection received: host=[local]
2025-10-11 20:29:11.724 UTC [1679535:1] [unknown] LOG:  connection received: host=[local]
2025-10-11 20:29:11.729 UTC [1679539:1] [unknown] LOG:  connection received: host=[local
...
2025-10-11 20:29:11.778 UTC [1679537:3] [unknown] LOG:  connection authorized: user=debian database=regression
application_name=pg_regress/create_schema
2025-10-11 20:29:11.778 UTC [1679533:3] [unknown] LOG:  connection authorized: user=debian database=regression
application_name=pg_regress/create_type
2025-10-11 20:29:11.778 UTC [1679539:3] [unknown] LOG:  connection authorized: user=debian database=regression
application_name=pg_regress/create_procedure
2025-10-11 20:29:11.778 UTC [1679536:3] [unknown] LOG:  connection authorized: user=debian database=regression
application_name=pg_regress/create_misc
2025-10-11 20:29:11.778 UTC [1679538:3] [unknown] LOG:  connection authorized: user=debian database=regression
application_name=pg_regress/create_table
2025-10-11 20:29:11.778 UTC [1679535:3] [unknown] LOG:  connection authorized: user=debian database=regression
application_name=pg_regress/create_function_c
2025-10-11 20:29:11.790 UTC [1679534:2] [unknown] FATAL:  IO in wrong state: 0
2025-10-11 20:29:11.790 UTC [1679534:3] [unknown] BACKTRACE:
        postgres: debian regression [local] authentication(+0x4371ec) [0x555565cd61ec]
        postgres: debian regression [local] authentication(+0x442bb2) [0x555565ce1bb2]
        postgres: debian regression [local] authentication(StartBufferIO+0x66) [0x555565ce1a10]
        postgres: debian regression [local] authentication(+0x43df60) [0x555565cdcf60]
        postgres: debian regression [local] authentication(StartReadBuffer+0x308) [0x555565cdc8a4]
        postgres: debian regression [local] authentication(ReadBufferExtended+0x64) [0x555565cdaace]
...
        postgres: debian regression [local] authentication(postmaster_child_launch+0x132) [0x555565c7787c]
        postgres: debian regression [local] authentication(+0x3dcd52) [0x555565c7bd52]
        postgres: debian regression [local] authentication(InitProcessGlobals+0) [0x555565c79c96]
        postgres: debian regression [local] authentication(+0x3194b2) [0x555565bb84b2]
        /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffa690891c]
        /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) [0x7fffa69089c4]
        postgres: debian regression [local] authentication(_start+0x20) [0x5555659842c8]

> The previous failure on greenfly was a TIMEOUT in the same test, as if
> a query was hanging.

Yeah, I'll try to reproduce it too...

> On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>> I've managed to reproduce it using qemu-system-riscv64 with Debian trixie
> Huh, that's interesting. What is the host architecture? When I saw
> that error myself and wondered about memory order, I dismissed the
> idea of trying with qemu, figuring that my x86 host's TSO would affect
> the coherency, but thinking again about that... I guess the compiler
> might still reorder during riscv codegen if there is something wrong
> with the barrier support, and even if it doesn't, the binary
> translation to x86 might also feel free to reorder stuff if there are
> no barrier instructions to prevent it? Or maybe that doesn't happen
> but your host is ARM?

I use AMD Ryzen 9 7900X, Ubuntu 24.04 and run the risc machine with:
qemu-system-riscv64 -machine virt -m 1G -smp 8 -cpu rv64 -device virtio-blk-device,drive=hd \
-drive file=.../dqib_riscv64-virt/image.qcow2,if=none,id=hd -device virtio-net-device,netdev=net \
-netdev user,id=net,hostfwd=tcp::22222-:22 -bios /usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
-kernel /usr/lib/u-boot/qemu-riscv64_smode/uboot.elf -object rng-random,filename=/dev/urandom,id=rng \
-device virtio-rng-device,rng=rng -nographic -append "root=LABEL=rootfs console=ttyS0"

I don't know what hardware greenfly uses (CC'ing Greg in case he'd like to
share some info on this), but timings are similar to what I'm seeing:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=greenfly&dt=2025-10-09%2000%3A06%3A03&stg=check
ok 160       - select_parallel                          5019 ms

`make check` on mine, with select_parallel repeated:
ok 160       - select_parallel                          6042 ms
ok 161       - select_parallel                          6089 ms
ok 162       - select_parallel                          6116 ms

copperhead and boomslang, which are using real RISC hardware [1], show
better timings:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=copperhead&dt=2025-10-11%2019%3A36%3A33&stg=check
ok 160       - select_parallel                          2677 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=boomslang&dt=2025-10-12%2002%3A17%3A53&stg=check
ok 160       - select_parallel                          3651 ms

Thank you for looking into this!

[1] https://www.postgresql.org/message-id/3db97903-884f-4b0c-b1cd-d7442e71ea75%40app.fastmail.com

Best regards,
Alexander

Attachment Content-Type Size
pgaio_io_update_state.asm text/plain 24.4 KB
pgaio_io_was_recycled.asm text/plain 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2025-10-12 05:59:14 RE: [PATCH TEST] Fix logical replication setup in subscription test `t/009_matviews.pl`
Previous Message Pavel Stehule 2025-10-12 04:57:56 Re: proposal: schema variables