pg_ctl start may return 0 even if the postmaster has been already started on Windows

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'PostgreSQL Hackers' <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Date: 2023-09-07 07:07:36
Message-ID: TYAPR01MB586654E2D74B838021BE77CAF5EEA@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear hackers,

While investigating the cfbot failure [1], I found a strange behavior of pg_ctl
command. How do you think? Is this a bug to be fixed or in the specification?

# Problem

The "pg_ctl start" command returns 0 (succeeded) even if the cluster has
already been started. This occurs on Windows environment, and when the command
is executed just after postmaster starts.

# Analysis

The primal reason is in wait_for_postmaster_start(). In this function the
postmaster.pid file is read and checked whether the start command is
successfully done or not.

Check (1) requires that the postmaster must be started after the our pg_ctl
command, but 2 seconds delay is accepted.

In the linux mode, the check (2) is also executed to ensures that the forked
process modified the file, so this time window is not so problematic.
But in the windows system, (2) is ignored, *so the pg_ctl command may be
succeeded if the postmaster is started within 2 seconds.*

```
if ((optlines = readfile(pid_file, &numlines)) != NULL &&
numlines >= LOCK_FILE_LINE_PM_STATUS)
{
/* File is complete enough for us, parse it */
pid_t pmpid;
time_t pmstart;

/*
* Make sanity checks. If it's for the wrong PID, or the recorded
* start time is before pg_ctl started, then either we are looking
* at the wrong data directory, or this is a pre-existing pidfile
* that hasn't (yet?) been overwritten by our child postmaster.
* Allow 2 seconds slop for possible cross-process clock skew.
*/
pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
if (pmstart >= start_time - 2 && // (1)
#ifndef WIN32
pmpid == pm_pid // (2)
#else
/* Windows can only reject standalone-backend PIDs */
pmpid > 0
#endif

```

# Appendix - how do I found?

I found it while investigating the failure. In the test "pg_upgrade --check"
is executed just after old cluster has been started. I checked the output file [2]
and found that the banner says "Performing Consistency Checks", which meant that
the parameter live_check was set to false (see output_check_banner()). This
parameter is set to true when the postmaster has been started at that time and
the pg_ctl start fails. That's how I find.

[1]: https://cirrus-ci.com/task/4634769732927488
[2]: https://api.cirrus-ci.com/v1/artifact/task/4634769732927488/testrun/build/testrun/pg_upgrade/003_logical_replication_slots/data/t_003_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20230905T080645.548/log/pg_upgrade_internal.log

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2023-09-07 07:08:10 Re: psql help message contains excessive indentations
Previous Message bt23nguyent 2023-09-07 07:06:40 Re: Transaction timeout