Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

From: Bryan Green <dbryan(dot)green(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Date: 2025-10-24 06:02:47
Message-ID: 07d00685-26b1-461d-99bf-7e973236e875@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/23/2025 11:02 PM, Bryan Green wrote:
> On 3/26/2025 3:18 AM, vignesh C wrote:
>> On Sat, 20 Jul 2024 at 00:03, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>
>>> On Tue, Jul 16, 2024 at 8:04 AM Yasir <yasir(dot)hussain(dot)shah(at)gmail(dot)com>
>>> wrote:
>>>> Please ignore the above 4 lines in my review. See my comments in blue.
>>>
>>> OK, so I think it's unclear what the next steps are for this patch.
>>>
>>> 1. On June 3rd, Michael Paquier said that Tom Lane proposed that,
>>> after doing what the patch currently does, we could simplify some
>>> other stuff. The details are unclear, and Tom hasn't commented.
>>>
>>> 2. On June 29th, Noah Misch proposed a platform-independent way of
>>> solving the problem.
>>>
>>> 3. On July 12th, Sutou Kouhei proposed using CreateProcess() to start
>>> the postmaster instead of cmd.exe.
>>>
>>> 4. On July 16th, Yasir Shah said that he tested the patch and found
>>> that the problem only exists in v17, not any prior release, which is
>>> contrary to my understanding of the situation. He also proposed a
>>> minor tweak to the patch itself.
>>>
>>> So, as I see it, we have three possible ways forward here. First, we
>>> could stick with the current patch, possibly with further work as per
>>> [1] or adjustments as per [4]. Second, we could abandon the current
>>> approach and adopt Noah's proposal in [2]. Third, we could possibly
>>> abandon the current approach and adopt Sutou's proposal in [3]. I say
>>> "possibly" because I can't personally assess whether this approach is
>>> feasible.
>>
>> Thank you very much, Robert, for summarizing this. If anyone has
>> suggestions on which approach might work best, please share them to
>> help move this discussion forward.
>>
>> Regards,
>> Vignesh
>>
>>
> Hello everyone,
>
> I've spent the last few days implementing Sutou-san's CreateProcess()
> proposal as a proof-of-concept. With all three approaches now available
> for comparison, I wanted to share my assessment and recommendation.
>
> The core issue, as we've established, is that cmd.exe intermediation
> prevents pg_ctl from getting the real postgres.exe PID, creating race
> conditions where we can't distinguish a newly-started postmaster from a
> pre-existing one. This causes pg_ctl start to incorrectly report success
> when attempting to start an already-running cluster.
>
> Horiguchi-san's process tree walking patch works and is ready to ship.
> It has real merit as a minimal-change solution. However, it keeps
> cmd.exe and adds process enumeration complexity to work around that
> decision. We're fixing symptoms rather than the root cause.
>
> Noah's token proposal is architecturally elegant and platform-
> independent. My concern is that it solves a Windows-specific problem by
> adding complexity to all platforms. Making other platforms adopt token-
> passing infrastructure for Windows's problems feels wrong. It also
> requires postmaster changes for what is fundamentally a pg_ctl issue,
> raising compatibility questions.
>
> The CreateProcess() approach eliminates cmd.exe entirely and gives us
> the actual postgres.exe PID directly. I've implemented this in the
> attached patch (about 250 lines of Windows-specific code). The
> implementation uses CreateProcessAsUser() with a restricted security
> token to handle the case where pg_ctl runs with elevated privileges—this
> drops Administrator group membership and dangerous privileges before
> launching the postmaster, matching the behavior of the existing
> CreateRestrictedProcess() function.
>
> For I/O redirection, we use Windows's native handle-based APIs. When a
> log file is specified, we open it with CreateFile() and pass it through
> STARTUPINFO. When there's no log file (interactive use and postgres -C
> queries), we inherit standard handles. One detail worth noting: we wait
> 2 seconds for no-log-file launches to distinguish postgres -C (which
> exits immediately) from actual server startup. This is long enough to
> catch quick exits even on loaded CI systems without impacting test suite
> performance.
>
> The implementation passes all regression tests on Windows, including CI
> environments with elevated privileges. It handles all the problem
> scenarios correctly: normal startup, detecting already-running clusters,
> postgres -C queries, pg_upgrade with multiple instances, and concurrent
> start attempts.
>
> I prefer the CreateProcess() approach. It fixes the root cause, not the
> symptoms. We get the real PID immediately with no process tree walking
> or PID verification complexity. It's a Windows-specific solution to a
> Windows-specific problem.
>
> If the CreateProcess() approach is deemed unacceptable, I would support
> Horiguchi-san's process tree walking patch as a pragmatic fallback. It
> fixes the bug and is tested. However, I believe we'd be choosing a
> workaround over a proper fix and adding maintenance burden we don't need.
>
> I'm opposed to the token approach for the reasons stated above—it's
> architecturally appealing but touches all platforms for a Windows-only
> problem.
>
> With this implementation complete, we now have all three options on the
> table for evaluation. I believe the CreateProcess() approach is
> technically sound and the right path forward, but I'm open to discussion
> and happy to address any concerns about the implementation.
>
> Regardless of which way we choose to go, I'll happily help review and
> test the solution.
>
> Best regards,
> BG

I realized the patch had an issue after sending. Here is the updated
patch. Again, this patch is just for an example of Sutou-san's suggestion.

Attachment Content-Type Size
v2-0001-Fix-pg_ctl-on-Windows-to-reliably-detect-already-run.patch text/plain 21.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-10-24 06:03:27 Re: Avoid resource leak (src/test/regress/pg_regress.c)
Previous Message Michael Paquier 2025-10-24 05:57:40 Re: Avoid handle leak (src/bin/pg_ctl/pg_ctl.c)