Re: BUG #16990: Random PANIC in qemu user context

From: Paul Guyot <pguyot(at)kallisys(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16990: Random PANIC in qemu user context
Date: 2021-05-02 20:20:39
Message-ID: 86C24765-95F7-464F-9677-B09A396A5F69@kallisys.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> Not sure what to tell you, other than "make sure qemu and your
> build toolchain are up-to-date".

In this scenario, I use postgresql 11.11 that was compiled by raspbian folks. I also used the qemu binary provided by ubuntu for focal, which happens to be 4.2 (not the latest).

I found out the corresponding function using readelf to locate the string constant.

For the record, the C function is here:
https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L811

The tight read loop is as follows:
32b548: e28d0004 add r0, sp, #4
32b54c: eb000679 bl 32cf38 <perform_spin_delay@@Base>
32b550: e5943004 ldr r3, [r4, #4]
32b554: e3130201 tst r3, #268435456 ; 0x10000000
32b558: 1afffffa bne 32b548 <RememberSimpleDeadLock@@Base+0xc4>

At address 32b550, it does perform a read, honoring the volatile pointer.

I guess the lock is acquired by the same function:
https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L824

The corresponding code is the following
32b508: ee070fba mcr 15, 0, r0, cr7, cr10, {5}
32b50c: e1953f9f ldrex r3, [r5]
32b510: e3832201 orr r2, r3, #268435456 ; 0x10000000
32b514: e1851f92 strex r1, r2, [r5]
32b518: e3510000 cmp r1, #0
32b51c: 1afffffa bne 32b50c <RememberSimpleDeadLock@@Base+0x88>
32b520: e3130201 tst r3, #268435456 ; 0x10000000
32b524: ee070fba mcr 15, 0, r0, cr7, cr10, {5}
32b528: 0a00000e beq 32b568 <RememberSimpleDeadLock@@Base+0xe4>

mcr 15, 0, r0, cr7, cr10, {5} is __sync_synchronize() and based on the previous instructions, r5 is equal to r4+4 as used in the tight loop.

I also guess the corresponding unlock function just follows, and disassembling it reveals the same use of __sync_synchronize().
32b644: ee070fba mcr 15, 0, r0, cr7, cr10, {5}
32b648: e1932f9f ldrex r2, [r3]
32b64c: e3c22201 bic r2, r2, #268435456 ; 0x10000000
32b650: e1831f92 strex r1, r2, [r3]
32b654: e3510000 cmp r1, #0
32b658: 1afffffa bne 32b648 <RememberSimpleDeadLock@@Base+0x1c4>
32b65c: ee070fba mcr 15, 0, r0, cr7, cr10, {5}
32b660: e8bd8070 pop {r4, r5, r6, pc}

QEMU user emulation documentation mentions something specific to threading on ARM.
https://qemu.readthedocs.io/en/latest/user/main.html
> Threading:
> On Linux, QEMU can emulate the clone syscall and create a real host thread (with a separate virtual CPU) for each emulated thread. Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock in order to preserve their semantics.

I have yet to determine what impact it could have here. Can we imagine a situation where the memory barrier was not honored and an unlock would be overwritten with a lock?

Eventually, I have tried to run the whole script with taskset -c 0 (which is fine with the tests as the target system, a Raspberry Pi Zero, is single core, while GitHub Linux runners have 2 vCPUs).
https://github.com/pguyot/pynab/commit/91011e68e446c69e317fd1198c58f85ff0cd5fb1
https://github.com/pguyot/pynab/runs/2486051700?check_suite_focus=true

I ran it four times so far, and no postgresql PANIC happens. So your hypothesis of a bug (limitation) of qemu 4.2 seems probable…
FYI, newer ARM architectures, starting with armv7l, have a dedicated instruction for memory barriers which is not used here as it is not recognized by Raspberry PI Zero CPU.

Paul

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2021-05-02 22:19:58 Re: BUG #16990: Random PANIC in qemu user context
Previous Message Alexander Korotkov 2021-05-02 18:41:14 Re: websearch_to_tsquery() returns queries that don't match to_tsvector()