Re: IO in wrong state on riscv64

From: Greg Burd <greg(at)burd(dot)me>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: IO in wrong state on riscv64
Date: 2025-10-12 13:10:27
Message-ID: 3d6eb098-344e-4616-a9ce-21d08ccffd16@app.fastmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Sun, Oct 12, 2025, at 1:00 AM, Alexander Lakhin wrote:
> Hello Thomas,
>
> 12.10.2025 06:35, Thomas Munro wrote:
>> On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin
>> <exclusion(at)gmail(dot)com> wrote:
>>> 2025-10-11 11:34:46.793 GMT [1169773:1] PANIC: !!!pgaio_io_wait|
>>> ioh->state changed from 0 to 1 at iteration 0
>>> # no other iteration number observed
>> Can you please disassemble pgaio_io_update_state() and
>> pgaio_io_was_recycled()? I wonder if the memory barriers are not
>> being generated correctly, causing the state and generation to be
>> loaded out of order, or something like that...
>
> Please find those attached (gdb "disass/m pgaio_io_update_state" misses
> the start of the function (but it's still disassembled below), so I
> decided to share the whole output).
> This is from clean master, without any modifications, but with the issue
> confirmed for this build:
> 2025-10-11 20:29:11.724 UTC [1679534:1] [unknown] LOG:  connection
> received: host=[local]
> 2025-10-11 20:29:11.725 UTC [1679536:1] [unknown] LOG:  connection
> received: host=[local]
> 2025-10-11 20:29:11.726 UTC [1679538:1] [unknown] LOG:  connection
> received: host=[local]
> 2025-10-11 20:29:11.724 UTC [1679533:1] [unknown] LOG:  connection
> received: host=[local]
> 2025-10-11 20:29:11.724 UTC [1679537:1] [unknown] LOG:  connection
> received: host=[local]
> 2025-10-11 20:29:11.724 UTC [1679535:1] [unknown] LOG:  connection
> received: host=[local]
> 2025-10-11 20:29:11.729 UTC [1679539:1] [unknown] LOG:  connection
> received: host=[local
> ...
> 2025-10-11 20:29:11.778 UTC [1679537:3] [unknown] LOG:  connection
> authorized: user=debian database=regression
> application_name=pg_regress/create_schema
> 2025-10-11 20:29:11.778 UTC [1679533:3] [unknown] LOG:  connection
> authorized: user=debian database=regression
> application_name=pg_regress/create_type
> 2025-10-11 20:29:11.778 UTC [1679539:3] [unknown] LOG:  connection
> authorized: user=debian database=regression
> application_name=pg_regress/create_procedure
> 2025-10-11 20:29:11.778 UTC [1679536:3] [unknown] LOG:  connection
> authorized: user=debian database=regression
> application_name=pg_regress/create_misc
> 2025-10-11 20:29:11.778 UTC [1679538:3] [unknown] LOG:  connection
> authorized: user=debian database=regression
> application_name=pg_regress/create_table
> 2025-10-11 20:29:11.778 UTC [1679535:3] [unknown] LOG:  connection
> authorized: user=debian database=regression
> application_name=pg_regress/create_function_c
> 2025-10-11 20:29:11.790 UTC [1679534:2] [unknown] FATAL:  IO in wrong
> state: 0
> 2025-10-11 20:29:11.790 UTC [1679534:3] [unknown] BACKTRACE:
>         postgres: debian regression [local] authentication(+0x4371ec)
> [0x555565cd61ec]
>         postgres: debian regression [local] authentication(+0x442bb2)
> [0x555565ce1bb2]
>         postgres: debian regression [local]
> authentication(StartBufferIO+0x66) [0x555565ce1a10]
>         postgres: debian regression [local] authentication(+0x43df60)
> [0x555565cdcf60]
>         postgres: debian regression [local]
> authentication(StartReadBuffer+0x308) [0x555565cdc8a4]
>         postgres: debian regression [local]
> authentication(ReadBufferExtended+0x64) [0x555565cdaace]
> ...
>         postgres: debian regression [local]
> authentication(postmaster_child_launch+0x132) [0x555565c7787c]
>         postgres: debian regression [local] authentication(+0x3dcd52)
> [0x555565c7bd52]
>         postgres: debian regression [local]
> authentication(InitProcessGlobals+0) [0x555565c79c96]
>         postgres: debian regression [local] authentication(+0x3194b2)
> [0x555565bb84b2]
>         /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffa690891c]
>         /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74)
> [0x7fffa69089c4]
>         postgres: debian regression [local]
> authentication(_start+0x20) [0x5555659842c8]
>
>> The previous failure on greenfly was a TIMEOUT in the same test, as if
>> a query was hanging.
>
> Yeah, I'll try to reproduce it too...
>
>> On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin
>> <exclusion(at)gmail(dot)com> wrote:
>>> I've managed to reproduce it using qemu-system-riscv64 with Debian trixie
>> Huh, that's interesting. What is the host architecture? When I saw
>> that error myself and wondered about memory order, I dismissed the
>> idea of trying with qemu, figuring that my x86 host's TSO would affect
>> the coherency, but thinking again about that... I guess the compiler
>> might still reorder during riscv codegen if there is something wrong
>> with the barrier support, and even if it doesn't, the binary
>> translation to x86 might also feel free to reorder stuff if there are
>> no barrier instructions to prevent it? Or maybe that doesn't happen
>> but your host is ARM?
>
> I use AMD Ryzen 9 7900X, Ubuntu 24.04 and run the risc machine with:
> qemu-system-riscv64 -machine virt -m 1G -smp 8 -cpu rv64 -device
> virtio-blk-device,drive=hd \
> -drive file=.../dqib_riscv64-virt/image.qcow2,if=none,id=hd -device
> virtio-net-device,netdev=net \
> -netdev user,id=net,hostfwd=tcp::22222-:22 -bios
> /usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
> -kernel /usr/lib/u-boot/qemu-riscv64_smode/uboot.elf -object
> rng-random,filename=/dev/urandom,id=rng \
> -device virtio-rng-device,rng=rng -nographic -append
> "root=LABEL=rootfs
> console=ttyS0"
>
> I don't know what hardware greenfly uses (CC'ing Greg in case he'd
> like to share some info on this),

Hello Alexander, thanks for digging into this issue.

The animal "greenfly" is an OrangePi RV2 [1] with 8GB of RAM. It has
the optional 256GB eMMC module and uses that for the root filesystem.
It's running Ubuntu 24.04.3 and I'm compiling with clang 18-1.3. I
think the rest of the details are in the animal registration or in the
logs, but let me know if you have questions or if there are things that
you'd like me to test.

> but timings are similar to what
> I'm seeing:
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=greenfly&dt=2025-10-09%2000%3A06%3A03&stg=check
> ok 160       - select_parallel                          5019 ms
>
> `make check` on mine, with select_parallel repeated:
> ok 160       - select_parallel                          6042 ms
> ok 161       - select_parallel                          6089 ms
> ok 162       - select_parallel                          6116 ms
>
> copperhead and boomslang, which are using real RISC hardware [1], show
> better timings:
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=copperhead&dt=2025-10-11%2019%3A36%3A33&stg=check
> ok 160       - select_parallel                          2677 ms
>
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=boomslang&dt=2025-10-12%2002%3A17%3A53&stg=check
> ok 160       - select_parallel                          3651 ms
>
> Thank you for looking into this!
>
> [1]
> https://www.postgresql.org/message-id/3db97903-884f-4b0c-b1cd-d7442e71ea75%40app.fastmail.com
>
> Best regards,
> Alexander
> Attachments:
> * pgaio_io_update_state.asm
> * pgaio_io_was_recycled.asm

best.

-greg

[1]
http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2025-10-12 13:20:07 Re: GNU/Hurd portability patches
Previous Message Alexander Lakhin 2025-10-12 13:00:00 Re: GNU/Hurd portability patches