From: | "Burd, Greg" <greg(at)burd(dot)me> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gavin Panella <gavinpanella(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected |
Date: | 2025-08-15 17:58:17 |
Message-ID: | B03D0B7E-0521-4E74-8884-312C1D530D67@burd.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> On Aug 13, 2025, at 8:01 PM, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Wed, Aug 13, 2025 at 7:38 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> Here's a new attempt at that. It picks futexes automatically, unless
>> you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
>> sysv as before.
Thomas, thanks for digging into this platform specific issue. I've read
your patch and done some research. Thoughts below...
> I compared my hand-rolled atomics logic against FreeBSD's user space
> semaphores[1]. sem_post() and sem_trywait() are about the same, and I
> guess there just aren't many ways to write those. Minor differences:
>
> 1. They use CAS in sem_post() because they want to report EOVERFLOW if
> you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
> just used fetch-and-add. Is that bad? I noticed their alternative
> older version (remove "_new" from [1]) also uses a whole 32 bit
> counter like me, so perhaps isn't so worried about exceeding a narrow
> field of bits, and does it my way.
The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
the sem_post() function shall fail if the maximum allowable value of the
semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to
go after all?
> 2. In sem_trywait() they reload the value while our
> pg_atomic_compare_exchange_u32() does that for you, and I also prefer
> to avoid assignment expression syntax if I can.
>
> On the other hand my v1 sem_wait() seemed a little janky and weird in
> comparison to theirs, and I was a bit worried about bugs. I thought
> about rewriting it to look more like that, but then I realised that I
> could use sem_trywait() to implement sem_wait() for the same net
> effect. It's much shorter and sweeter and more intuitively
> understandable, I think?
So if I understand it the way your sem_trywait() works is:
* Try non-blocking acquire with pg_sem_trywait()
* If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
* When woken up: Loop back to try again
This is essentially how sem_trywait() implementations work in Linux IIRC,
they use futexes or similar kernel primitives for efficient blocking.
> I also didn't bother to supply a couple of functions that posix_sema.c
> doesn't use.
I think that's fine for now.
> There's no immediately obvious performance difference, but I wasn't
> expecting there to be: it's used in the slow path of our LWLocks and
> they already have similar user space atomics in front, so you don't
> really avoid entering the kernel with this technique except perhaps
> just occasionally if you wait just as someone posts. This is all just
> about the convenience of bannishing System V cruft. It doesn't seem
> to be any slower though, in simple testing, and CI might even be a bit
> faster based on spot checks of a few runs... but proper statistical
> tools might be needed to see if that is a real phenomenon.
That lines up with my intuition after reading the patch. I don't know
that it is even worth diving too deep in performance testing this.
> Anyone got any better ideas for how to organise the build scripts,
> control the feature, naming etc? I came up with is this, which
> automatically falls back to sysv if it can't find what it needs
> (respecting Apple's deployment target thing following what we did for
> preadv):
>
> meson.build: sema_type = "unnamed_posix+emulation"
> configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"
I'm not an expert on Meson, but I do have years of Autoconf work under
my belt. Offhand this seems reasonable, but I'm new to this project.
> The pg_futex.h abstraction layer doesn't have any support for other
> OSes yet because I tried to pare this patch down to the absolute
> minimum to solve this problem, but I did try at least a little bit to
> anticipate that (having removed it from earlier versions from a few
> years ago) when designing the futex contracts. We could trivially add
> at least OpenBSD support to use emulated semaphores there. I never
> quite figured out whether the NetBSD futex API is really available
> outside its Linux syscall emulator, but that might be doable too. And
> futexes might of course turn out to have applications in other
> synchronisation primitives.
>
> Any thoughts on whether this is worth pursuing, and what kinds of
> validation might be useful?
>
> [1] https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c
> <v2-0001-Use-futex-based-semaphore-emulation-on-macOS.patch>
Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
Overall, I like the approach, it addresses a real limitation on macOS and
could be expanded in the future to cover other similar issues on other
platforms.
best.
-greg
[1] https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/sem_post.html
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2025-08-15 17:58:19 | Re: index prefetching |
Previous Message | Nathan Bossart | 2025-08-15 17:57:52 | Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture |