| From: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
|---|---|
| To: | Michael Paquier <michael(at)paquier(dot)xyz> |
| Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
| Subject: | Re: injection_points: Switch wait/wakeup to use atomics rather than latches |
| Date: | 2026-06-12 07:02:50 |
| Message-ID: | 0C7D1930-F8BD-45D3-980B-841F1DBE1204@yandex-team.ru |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Michael,
> On 3 Jun 2026, at 03:23, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> That's a direction. Only mmap() would not be sufficient, as WIN32 has
> its own non-POSIX idea on the matter with CreateFileMapping() &
> friends.
Right, mmap() alone is not enough. I hacked up a prototype (on top of
your atomics commit) that maps the state portably on both sides: POSIX
mmap() in the backend and a file-backed CreateFileMapping() on WIN32,
and the same in a small standalone client. Perl cannot mmap()
portably, so the mapping is done by a tiny C helper that TAP drives,
rather than from the test script itself.
> I am wondering if we should think harder about an interface that could
> make such things easier for extensions. Or we could have a portable
> layer added to injection_points, as well..
While prototyping I hit a constraint that I think answers this. To arm
a point without SQL it is not enough to reach this module's wait state:
you have to reach the registry of active points, i.e.
ActiveInjectionPoints in injection_point.c. INJECTION_POINT() consults
that array, and the module only supplies the callback once the core has
found the name. So the part that has to be reachable from outside lives
in the core, not in the module.
That splits the problem into two layers rather cleanly:
- core: the active-points array is backed by a file
(injection_points.shm in the data directory). This is the generic
piece - any extension could arm or inspect points out of band, and
it is what makes attach-without-SQL possible. The lock-free
generation protocol that reads the array is unchanged, so external
readers can rely on it too.
- module: injection_points keeps its own file
(injection_points_wait.shm) for the wait/wakeup coordination of the
"wait" action. That part is test-specific and stays out of the
core.
A standalone client (injection_points_state) maps both files the way
the backend does and can attach/detach a point and detect+release a
waiter, with no backend connection and no PGPROC. A TAP test runs the
whole flow without SQL for arming or waking: arm a wait point with the
client, trigger it from a background session, let the client poll the
mapped file until the point is reached (so no sleep, and it behaves the
same on slow and fast animals), disarm it, then wake. SQL is used only
to trigger the point and to observe injection_points_list().
This is a rough prototype to make the discussion concrete, not a
proposal of the final shape. To name few open points:
1. Where the portable mapping helper should live.
1. Keep the create-or-attach-file helper (POSIX + WIN32) local to
injection_point.c, as in the prototype. Simplest, no new public
surface.
2. Factor a small portable "named file-backed shared region" that any
extension could use - an "interface for extensions". More
general, but a larger commitment and more to review.
I have a slight preference for 1 now, growing into 2 only once a
second user actually appears.
2. One file or two. The prototype keeps the core registry and the
module's wait coordination in two files. Folding a wait counter into
the core InjectionPointEntry would make it a single file, but it
pushes test-only wait semantics into the core struct, which I would
rather avoid. Slight preference for two files.
3. The client writes the registry locklessly (no InjectionPointLock), so
it assumes nothing else attaches/detaches the same array
concurrently. That holds for arming points out of band around a
controlled session, but it is not a general concurrent-safe writer.
I think that is acceptable for tests, but I am not sure.
4. The path defaults to injection_points.shm under the data directory,
with an env override (PG_INJECTION_POINTS_FILE) so a point could be
armed before the data dir exists or in single player mode. I have
not added a test for that, so probably it does not work. Yet.
5. The external mirror reads and writes the pg_atomic_uint{32,64} fields
as plain integers. That only holds where those atomics are a bare
value; on a platform without native 64-bit atomics pg_atomic_uint64
falls back to a spinlock-protected struct with a different layout, so
the byte mirror would not match (and the backend's width assertion
would fail to build there in the first place). Even where they are
native, the 64-bit field forces 8-byte alignment that the mirror must
replicate - I got bitten by exactly that on the 32-bit build, where
the entries array otherwise starts 4 bytes too early. So a robust
portable layer, if we want one, is not as trivial as it first looked
to me.
6. Whether the external interface should do more than "wait". The
file-backed registry already lets an outside process attach a point
to any callback (error, notice, ...); the only thing we can observe
back without SQL today is that a wait was reached. A natural
generalization is a per-point hit counter that any process bumps when
it runs the point, readable from outside. That would let a test
assert "initdb really went through this path", or "the postmaster
reached this point before it failed at startup" - cases where there
is no backend to query. I have not built it, and it leans the same
way as 2. (it would put an observation counter into the core entry),
but it seems like the obvious next use-case if we go this route.
WDYT? I am not at all sure the file-backed registry is the right
long-term shape; if you prefer the elog-and-grep route I am happy to drop
this, it was mostly a way to see how invasive the alternative really is.
Thank you!
Best regards, Andrey Borodin.
P.S. Fun observation - it speeds up tests. I swapped the "point reached"
detection in test_misc/010_index_concurrently_upsert.pl (41 cases, each polling
pg_stat_activity every 100ms via a fresh psql) for a 10ms poll of the mapped
file through the client. Median test wall time dropped from 2.2s to 1.6s
(~27%) on my MacBook Air M5.
| Attachment | Content-Type | Size |
|---|---|---|
| v2026-06-12-0001-injection_points-Switch-wait-wakeup-to-r.patch | application/octet-stream | 4.9 KB |
| v2026-06-12-0003-injection_points-attach-and-coordinate-w.patch | application/octet-stream | 34.9 KB |
| v2026-06-12-0002-injection_points-back-the-active-points-.patch | application/octet-stream | 10.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Roman Khapov | 2026-06-12 07:08:29 | LDAP timeout options |
| Previous Message | Kyotaro Horiguchi | 2026-06-12 06:59:30 | Re: Report bytes and transactions actually sent downtream |