ssl tests aren't concurrency safe due to get_free_port()

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: ssl tests aren't concurrency safe due to get_free_port()
Date: 2022-10-02 16:49:31
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


See e.g.
2022-10-01 15:15:21.849 UTC [41962][postmaster] LOG: could not bind IPv4 address "": Address already in use
2022-10-01 15:15:21.849 UTC [41962][postmaster] HINT: Is another postmaster already running on port 57003? If not, wait a few seconds and retry.
2022-10-01 15:15:21.849 UTC [41962][postmaster] WARNING: could not create listen socket for ""

I downloaded all test logs and grepping for 57003 shows the problem:

3:# Checking port 57003
4:# Found port 57003
22:# Running: /usr/sbin/slapd -f /tmp/cirrus-ci-build/build/testrun/ldap/001_auth/data/slapd.conf -h ldap://localhost:57002 ldaps://localhost:57003

253:2022-10-01 15:15:25.103 UTC [42574][client backend] [[unknown]][3/1:0] DETAIL: Connection matched pg_hba.conf line 1: "local all all ldap ldapurl="ldaps://localhost:57003/dc=example,dc=net??sub?(uid=$username)" ldaptls=1"

2:# Checking port 57003
3:# Found port 57003
8:Connection string: port=57003 host=/tmp/1k5yhaWLQ1

2:2022-10-01 15:15:20.668 UTC [41740][postmaster] LOG: listening on Unix socket "/tmp/1k5yhaWLQ1/.s.PGSQL.57003"
58:2022-10-01 15:15:21.849 UTC [41962][postmaster] HINT: Is another postmaster already running on port 57003? If not, wait a few seconds and retry.

I.e. we chose the same port for slapd as part of ldap's as for the
postgres instance of ssl's

I don't think get_free_port() has any chance of being reliable as is. It's
fundamentally racy just among concurrently running tests, without even
considering things external to the tests (given it's using the range of ports
auto-assigned for client tcp ports...).

The current code is from 803466b6ffa, which said:
This isn't 100% bulletproof, since
conceivably something else on the machine could grab the port between
the time we check and the time we actually start the server. But that's
a pretty short window, so in practice this should be good enough.

but I've seen this fail a couple times, so I suspect it's unfortunately not
good enough.

I can see a few potential ways of improving the situation:

1) Improve the port we start the search for a unique port from. We use the
following bit right now:

# Tracking of last port value assigned to accelerate free port lookup.
$last_port_assigned = int(rand() * 16384) + 49152;

It *might* be less likely that we hit conflicts if we start the search on a
port based on the pid, rather than rand(). We e.g., could start at
something like (($$ * 16) % 16384 + 49152), giving a decent likelihood that
each test has 16 free ports.

Perhaps also worth to increase the range of ports searched?

2) Use a lockfile containing a pid to protect the choice of a port within a
build directory. Before accepting a port get_free_port() would check if the
a lockfile exists for the port and if so, if the test using it is still
alive. That will protect against racyness between multiple tests inside a
build directory, but won't protect against races between multiple builds
running concurrently on a machine (e.g. on a buildfarm host)

3) We could generate unique port ranges for each test and thus remove the
chance of conflicts within a builddir and substantially reduce the
likelihood of conflicts between build directories (as the conflict would be
between two tests in different build directories, rather than all tests).

This would be easy to do in the meson build, but nontrivial in autoconf

4) Add retries to the tests that need get_free_port(). That seems hard to get
right, because every single restart of postgres (or some other service that
we used get_free_port()) provides the potential for a new conflict.


Andres Freund


Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-10-02 16:53:01 Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]
Previous Message Jonathan S. Katz 2022-10-02 16:36:52 Re: Question: test "aggregates" failed in 32-bit machine