Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

From: Greg Burd <greg(at)burd(dot)me>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gavin Panella <gavinpanella(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Date: 2025-08-16 14:00:09
Message-ID: 64F044AB-08FD-46E6-80FA-C912F42E77D2@burd.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> On Aug 15, 2025, at 10:25 PM, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg(at)burd(dot)me> wrote:
>>> 1. They use CAS in sem_post() because they want to report EOVERFLOW if
>>> you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
>>> just used fetch-and-add. Is that bad? I noticed their alternative
>>> older version (remove "_new" from [1]) also uses a whole 32 bit
>>> counter like me, so perhaps isn't so worried about exceeding a narrow
>>> field of bits, and does it my way.
>>
>> The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
>> the sem_post() function shall fail if the maximum allowable value of the
>> semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to
>> go after all?
>
> Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was
> looking at one of the earlier versions. But I see now that POSIX 2024
> has just added it! So the question is: do we care?

I’m on the fence about "caring", if your code fixes the issue we could
simply add a comment about eventually needing to support EOVERFLOW. I
don't think what you have in the patch is wrong and we're solving our
issue, not the worlds inconsistencies.

> Supposing posix_sema.c checked that the maximum number of backends
> didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
> today if you later exceed it in sem_post() you'll get either FATAL:
> EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
> including a hang due to a corrupted counter, I guess). I don't know
> if systems with a low enough value exist in the wild, but if this
> patch were to define its own PG_SEM_VALUE_MAX and then redefine
> SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would
> naturally use UINT32_MAX, and then that check would be tautological
> given the types involved, not to mention our own lower limits. So
> perhaps it's not worth more than a comment?
>
> I've found only very brief discussions in the standards community so far:
>
> https://austingroupbugs.net/view.php?id=315
> https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow

I see what you mean, a comment that captures this decision and we're
done IMO.

>
>> So if I understand it the way your sem_trywait() works is:
>> * Try non-blocking acquire with pg_sem_trywait()
>> * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
>> * When woken up: Loop back to try again
>>
>> This is essentially how sem_trywait() implementations work in Linux IIRC,
>> they use futexes or similar kernel primitives for efficient blocking.
>
> Yeah. Linux gave us the name futex and seems to have established a
> sort of de facto standard semantics for it, as seen here in macOS.

Indeed, the "fast userspace mutex". *sigh* ;-P

I was calling this out because an implementation of sem_trywait() that
used sem_wait() that wasn't based on futex would be less correct AFAICT.

> It's not the only way to skin that cat... IIUC there were various
> older systems with the double-check to avoid the race against user
> space, and "address as wait channel" wasn't new either, but I think at
> least some of the other things like that might have required a
> waitlist in user space; I'm not too sure about that history though, a
> lot of it was probably closed source. It's not even too clear to me
> how that waitlist tradeoff works out, but in any case, that is exactly
> how a fallback implementation would probably work if we ever decided
> we wanted pg_futex everywhere: stick a ConditionVariable (or similar)
> into the struct, or if you want to preserve sizeof(pg_futex_u32) ==
> sizeof(uint32) and maybe even do away with the struct and just use
> pg_atomic_uint32 directly then you could hash the address to select a
> cv from a table of (say) 256 of them, with the annoying detail that
> you'd probably need a separate wake/wait function pair for {DSM
> segment handle, offset} for memory mapped at different addresses. But
> we don't need to make decisions about any of that vapourware to solve
> this immediate single-platform problem.

>> Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
>
> We'd be checking if a uint32 is > UINT32_MAX, assuming again we that
> we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely
> unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if
> you tried it.

Oops, yep. You are right. :)

>> Overall, I like the approach, it addresses a real limitation on macOS and
>> could be expanded in the future to cover other similar issues on other
>> platforms.
>
> Thanks for looking! And especially for the sanity check on the
> definitions of sem_wait()/sem_trywait().

Sure thing, anytime. Not sure I helped much...

-greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-08-16 14:22:50 Re: Retail DDL
Previous Message Robert Treat 2025-08-16 13:41:57 Re: Adding REPACK [concurrently]