Re: greenfly lwlock corruption in REL_14_STABLE and REL_15_STABLE

From: "Greg Burd" <greg(at)burd(dot)me>
To: "Thomas Munro" <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: greenfly lwlock corruption in REL_14_STABLE and REL_15_STABLE
Date: 2025-12-11 17:27:37
Message-ID: 4ccf62c4-48ed-47cb-badc-9ae436d91b39@app.fastmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On Wed, Dec 10, 2025, at 12:10 AM, Thomas Munro wrote:
> Beginning a week ago, greenfly (RISC-V, Clang 20.1) has failed like
> this in 5 of 8 runs of the pgbench tests on the two oldest branches:

Hey Thomas, raising this. I should more closely monitor my farm animals. As greenfly is one of them and my login name is littered in the logs (gburd) I suppose I should dive into this.

> TRAP: FailedAssertion("!(oldstate & LW_VAL_EXCLUSIVE)", File:
> "lwlock.c", Line: 1850, PID: 1536294)
> postgres: main: gburd postgres [local] CREATE
> TYPE(ExceptionalCondition+0x72)[0x2ad1326922]
> postgres: main: gburd postgres [local] CREATE
> TYPE(LWLockRelease+0x51e)[0x2ad1634e60]
> postgres: main: gburd postgres [local] CREATE
> TYPE(_bt_first+0x7f8)[0x2ad139c314]
> postgres: main: gburd postgres [local] CREATE
> TYPE(btgettuple+0xca)[0x2ad13996f8]
> postgres: main: gburd postgres [local] CREATE
> TYPE(index_getnext_tid+0x2a)[0x2ad138bd66]
> postgres: main: gburd postgres [local] CREATE
> TYPE(index_getnext_slot+0x24)[0x2ad138bf56]
> postgres: main: gburd postgres [local] CREATE
> TYPE(systable_getnext+0x18)[0x2ad138a97c]
> postgres: main: gburd postgres [local] CREATE
> TYPE(GetNewOidWithIndex+0xfc)[0x2ad13ed284]
> postgres: main: gburd postgres [local] CREATE
> TYPE(EnumValuesCreate+0x58)[0x2ad14090ec]
> postgres: main: gburd postgres [local] CREATE
> TYPE(DefineEnum+0x10a)[0x2ad14bb948]
> postgres: main: gburd postgres [local] CREATE TYPE(+0x3f0336)[0x2ad164a336]
> postgres: main: gburd postgres [local] CREATE
> TYPE(standard_ProcessUtility+0x468)[0x2ad1649560]
> postgres: main: gburd postgres [local] CREATE TYPE(+0x3eec0e)[0x2ad1648c0e]
> postgres: main: gburd postgres [local] CREATE TYPE(+0x3ee418)[0x2ad1648418]
> postgres: main: gburd postgres [local] CREATE
> TYPE(PortalRun+0x160)[0x2ad1647ec8]
> postgres: main: gburd postgres [local] CREATE
> TYPE(PostgresMain+0x1b34)[0x2ad1646000]
> postgres: main: gburd postgres [local] CREATE TYPE(+0x36205a)[0x2ad15bc05a]
> postgres: main: gburd postgres [local] CREATE
> TYPE(ClosePostmasterPorts+0x0)[0x2ad15bb8e0]
> postgres: main: gburd postgres [local] CREATE
> TYPE(PostmasterMain+0x100a)[0x2ad15b92ac]
> postgres: main: gburd postgres [local] CREATE TYPE(+0x2cac90)[0x2ad1524c90]
> /lib/riscv64-linux-gnu/libc.so.6(+0x277cc)[0x3f9caa77cc]
> /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x78)[0x3f9caa7878]
> postgres: main: gburd postgres [local] CREATE TYPE(_start+0x20)[0x2ad1326ac0]
>
> That's:
>
> if (mode == LW_EXCLUSIVE)
> oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_EXCLUSIVE);
> else
> oldstate = pg_atomic_sub_fetch_u32(&lock->state, LW_VAL_SHARED);
>
> /* nobody else can have that kind of lock */
> Assert(!(oldstate & LW_VAL_EXCLUSIVE));
>
> I will see if I can reproduce it or see something wrong under qemu,
> but that'll take some time to set up...

It'll take me far less time to reproduce than you. :)

> Since the RISC-V GCC animals aren't showing any problem, I wondered if
> this could be related to commits d8ba910b, 1c7cba4, but that was ~30
> days ago, applied to all branches and prevented reordering of
> non-atomic loads, while here I assume we have __sync_fetch_and_sub()
> without a connection to other memory as far as I can see immediately.
> Commits 332693e7, da39714 touched lwlock.c ~15 days ago, but not in a
> way that immediately seems relevant; if there were a relevant flag
> protocol difference in these branches, then why only this system? It
> also passed half a dozen times before the cluster of failures. That
> seems to point back towards codegen problems, but perhaps of a
> different kind. Unless something else is going really wrong, but it's
> hard to imagine that we forgot which lock type we held...
>
> date | branch | commit | assert_failed
> ------------+---------------+---------------------------------+---------------
> 2025-12-09 | REL_15_STABLE | f188bc5 doc: Fix statement a... |
> 2025-12-09 | REL_14_STABLE | 4c4fa53 doc: Fix statement a... | t
> 2025-12-09 | REL_15_STABLE | 52a9588 Doc: fix typo in has... | t
> 2025-12-05 | REL_15_STABLE | b9a02b9 Fix setting next mul... |
> 2025-12-05 | REL_14_STABLE | 4896955 Fix setting next mul... |
> 2025-12-05 | REL_15_STABLE | 7e54eac Show version of node... | t
> 2025-12-03 | REL_15_STABLE | 8cfb174 Set next multixid's ... | t
> 2025-12-03 | REL_14_STABLE | 81416e1 Set next multixid's ... | t
> 2025-12-02 | REL_15_STABLE | 7792bdc Fix amcheck's handli... |
> 2025-12-02 | REL_14_STABLE | fbb4b60 Fix amcheck's handli... |
> 2025-11-29 | REL_15_STABLE | 134a8ee Avoid rewriting data... |
> 2025-11-29 | REL_14_STABLE | 2d5b97b Avoid rewriting data... |
> 2025-11-27 | REL_15_STABLE | f19502f Allow indexscans on ... |
> 2025-11-27 | REL_14_STABLE | 9e77323 Allow indexscans on ... |
> 2025-11-27 | REL_15_STABLE | f9f9283 doc: Fix misleading ... |
> 2025-11-26 | REL_15_STABLE | eb7743e doc: Clarify passphr... |
> 2025-11-26 | REL_14_STABLE | 9a26ff8 doc: Clarify passphr... |
> 2025-11-25 | REL_15_STABLE | da39714 lwlock: Fix, current... |
> 2025-11-25 | REL_14_STABLE | 332693e lwlock: Fix, current... |
> 2025-11-24 | REL_15_STABLE | ea757e8 Fix incorrect IndexO... |
> 2025-11-24 | REL_14_STABLE | ea36c2f Fix incorrect IndexO... |
> 2025-11-22 | REL_15_STABLE | 5516485 jit: Adjust AArch64-... |
> 2025-11-22 | REL_14_STABLE | 035a1f5 jit: Adjust AArch64-... |
> 2025-11-19 | REL_15_STABLE | 7c49407 Print new OldestXID ... |
> 2025-11-19 | REL_14_STABLE | 11cc0f4 Print new OldestXID ... |
> 2025-11-18 | REL_15_STABLE | 9f5a58a Don't allow CTEs to ... |
> 2025-11-18 | REL_14_STABLE | b853974 Don't allow CTEs to ... |
> 2025-11-18 | REL_15_STABLE | 3995e4a Define PS_USE_CLOBBE... |
> 2025-11-18 | REL_14_STABLE | 29a3e22 Define PS_USE_CLOBBE... |
> 2025-11-17 | REL_15_STABLE | ad5cc3a Update .abi-complian... |
> 2025-11-16 | REL_15_STABLE | 5d5b05c Doc: include MERGE i... |
> 2025-11-14 | REL_15_STABLE | d61af52 Add note about Creat... |
> 2025-11-14 | REL_14_STABLE | 4c179cc Add note about Creat... |
> 2025-11-13 | REL_15_STABLE | c663152 doc: Improve descrip... |
> 2025-11-13 | REL_14_STABLE | 7aa83ea doc: Improve descrip... |
> 2025-11-12 | REL_15_STABLE | 21a9014 Clear 'xid' in dummy... |
> 2025-11-12 | REL_14_STABLE | 84f1bf4 Clear 'xid' in dummy... |
> 2025-11-12 | REL_14_STABLE | 4ef048f doc: Document effect... |
> 2025-11-12 | REL_15_STABLE | 608566b doc: Document effect... |
> 2025-11-12 | REL_14_STABLE | f8a0ea8 Fix range for commit... |
> 2025-11-12 | REL_15_STABLE | 97cd4b6 Fix pg_upgrade aroun... |
> 2025-11-12 | REL_15_STABLE | 74b26c8 doc: Fix incorrect s... |
> 2025-11-11 | REL_15_STABLE | 32f3881 Stamp 15.15.... |
> 2025-11-11 | REL_14_STABLE | 9ad034b Stamp 14.20.... |
> 2025-11-10 | REL_15_STABLE | 70d03b5 Last-minute updates ... |
> 2025-11-10 | REL_14_STABLE | ee953cd Last-minute updates ... |
> 2025-11-10 | REL_15_STABLE | 9142156 libpq: Prevent some ... |
> 2025-11-10 | REL_14_STABLE | e792be6 Translation updates... |
> 2025-11-09 | REL_15_STABLE | e334e80 Release notes for 18... |
> 2025-11-09 | REL_14_STABLE | 06827c5 Release notes for 18... |
> 2025-11-08 | REL_15_STABLE | 1c7cba4 Fix generic read and... |
> 2025-11-08 | REL_14_STABLE | d8ba910 Fix generic read and... |

I'll see what I can do to find the offending commit(s).

best.

-greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2025-12-11 17:30:46 Re: Add a greedy join search algorithm to handle large join problems
Previous Message Tom Lane 2025-12-11 17:24:26 Re: Solaris versus our NLS files