Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: "Burd, Greg" <greg(at)burd(dot)me>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gavin Panella <gavinpanella(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Date: 2025-08-16 02:25:59
Message-ID: CA+hUKG+4XVv7xinPA2h7MNCnaTpD+0DRW_rCjE-AdPwW3OGd1Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg(at)burd(dot)me> wrote:
> > 1. They use CAS in sem_post() because they want to report EOVERFLOW if
> > you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
> > just used fetch-and-add. Is that bad? I noticed their alternative
> > older version (remove "_new" from [1]) also uses a whole 32 bit
> > counter like me, so perhaps isn't so worried about exceeding a narrow
> > field of bits, and does it my way.
>
> The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
> the sem_post() function shall fail if the maximum allowable value of the
> semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to
> go after all?

Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was
looking at one of the earlier versions. But I see now that POSIX 2024
has just added it! So the question is: do we care?

Supposing posix_sema.c checked that the maximum number of backends
didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
today if you later exceed it in sem_post() you'll get either FATAL:
EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
including a hang due to a corrupted counter, I guess). I don't know
if systems with a low enough value exist in the wild, but if this
patch were to define its own PG_SEM_VALUE_MAX and then redefine
SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would
naturally use UINT32_MAX, and then that check would be tautological
given the types involved, not to mention our own lower limits. So
perhaps it's not worth more than a comment?

I've found only very brief discussions in the standards community so far:

https://austingroupbugs.net/view.php?id=315
https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow

> So if I understand it the way your sem_trywait() works is:
> * Try non-blocking acquire with pg_sem_trywait()
> * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
> * When woken up: Loop back to try again
>
> This is essentially how sem_trywait() implementations work in Linux IIRC,
> they use futexes or similar kernel primitives for efficient blocking.

Yeah. Linux gave us the name futex and seems to have established a
sort of de facto standard semantics for it, as seen here in macOS.

It's not the only way to skin that cat... IIUC there were various
older systems with the double-check to avoid the race against user
space, and "address as wait channel" wasn't new either, but I think at
least some of the other things like that might have required a
waitlist in user space; I'm not too sure about that history though, a
lot of it was probably closed source. It's not even too clear to me
how that waitlist tradeoff works out, but in any case, that is exactly
how a fallback implementation would probably work if we ever decided
we wanted pg_futex everywhere: stick a ConditionVariable (or similar)
into the struct, or if you want to preserve sizeof(pg_futex_u32) ==
sizeof(uint32) and maybe even do away with the struct and just use
pg_atomic_uint32 directly then you could hash the address to select a
cv from a table of (say) 256 of them, with the annoying detail that
you'd probably need a separate wake/wait function pair for {DSM
segment handle, offset} for memory mapped at different addresses. But
we don't need to make decisions about any of that vapourware to solve
this immediate single-platform problem.

> Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX

We'd be checking if a uint32 is > UINT32_MAX, assuming again we that
we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely
unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if
you tried it.

> Overall, I like the approach, it addresses a real limitation on macOS and
> could be expanded in the future to cover other similar issues on other
> platforms.

Thanks for looking! And especially for the sanity check on the
definitions of sem_wait()/sem_trywait().

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2025-08-16 03:04:47 Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Previous Message Michael Paquier 2025-08-16 00:22:18 Re: Sequence Access Methods, round two