| From: | Bryan Green <dbryan(dot)green(at)gmail(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> |
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects |
| Date: | 2025-11-03 22:12:36 |
| Message-ID: | a7109b5f-6590-476c-810c-18f1af588238@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 11/3/2025 9:29 AM, Andres Freund wrote:
> On 2025-11-03 09:25:11 -0600, Bryan Green wrote:
>> On 11/3/2025 9:19 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2025-11-03 09:12:03 -0600, Bryan Green wrote:
>>>> We just need to call CreateJobObject() in PostmasterMain(), configure
>>>> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster.
>>>> Children inherit membership automatically. When the job handle closes on
>>>> postmaster exit, the kernel terminates all children atomically. This is
>>>> kernel-enforced with no polling and no race conditions.
>>>
>>> What happens if a postmaster child exits irregularly? Is postmaster terminated
>>> as well?
>>>
>>
>> No, Job Objects are unidirectional.
>
> Great.
>
>
>>>> The patch has been tested on Windows 10/11 with both MSVC and MinGW
>>>> builds. Nested jobs fail gracefully as expected. Clean shutdown is
>>>> unaffected. Crash tests with taskkill /F, debugger abort, and access
>>>> violations all correctly terminate children immediately with zero orphans.
>>>>
>>>> This patch does not include automated tests because the core
>>>> functionality (orphan prevention on crash) requires simulating process
>>>> termination, which is difficult to test reliably in CI.
>>>
>>> Why is it difficult to test in CI? We do some related tests in
>>> 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add
>>> tests for postmaster?
>>>
>>
>> Fair point. I was hesitant because testing the actual orphan prevention
>> requires killing the postmaster while backends are active, which seemed
>> fragile. But you're right that we already test similar scenarios.
>>
>> I can add a test to 013_crash_restart.pl (or a new Windows-specific test
>> file) that:
>> 1. Starts server with active backend
>> 2. Kills postmaster ungracefully (taskkill /F)
>> 3. Verifies backend process terminates automatically
>> 4. Confirms clean restart
>>
>> Would that be sufficient, or do you have other test scenarios in mind?
>
> That's pretty much what I had in mind.
>
> Greetings,
>
> Andres Freund
I've implemented the test in 013_crash_restart.pl.
The test passes on Windows 10/11 with both MSVC and MinGW builds.
Backends are typically terminated within 100-200ms after postmaster
kill, confirming the Job Object KILL_ON_JOB_CLOSE mechanism works as
intended.
Updated patch (v2) attached.
--
Bryan
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0001-Use-Windows-Job-Objects-to-prevent-orphaned-child.patch | text/plain | 14.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | David Rowley | 2025-11-03 22:47:08 | Re: Have the planner convert COUNT(1) / COUNT(not_null_col) to COUNT(*) |
| Previous Message | Tom Lane | 2025-11-03 21:55:45 | Re: Use merge-based matching for MCVs in eqjoinsel |