Re: Heads Up: cirrus-ci is shutting down June 1st

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc: Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Heads Up: cirrus-ci is shutting down June 1st
Date: 2026-05-27 18:10:46
Message-ID: qe4lh2i5di2gh7bxkbfisifaohrvyfukbybwxwzxdnll45hnt3@luod7i2mon67
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> Here is the v2, I took Jelte's patch and reviewed & merged it with my
> patch. Updates and questions are:
>
> 1- I continued to use Jelte's container method (Linux tasks only for
> now, BSD tasks will be included in the future) because I think that is
> the future-proof way since we might want to generate our container
> images in the future. Also, up-to-date Debian images can be tested
> with this way; otherwise we would need to use Ubuntu 24.04.

Good.

> 2- io_uring tests work on the Linux Meson task.

Is there a reason to not just do that for all the tasks?

> 3- I didn't put commands to helper scripts for now. I think it is a
> good thing to have a helper script but it would be better to have this
> helper script after the first version is committed since it can extend
> the timeline. Also, I found that having all commands in one file makes
> debugging easier.

Hm. I'm a bit worried about this getting pretty unmaintainable, due to the
repetition. I think at least we need to use yaml anchors to deduplicate some
steps.

> 4- FreeBSD task has these options:
>
> PG_TEST_INITDB_EXTRA_OPTS: >-
> -c debug_copy_parse_plan_trees=on
> -c debug_write_read_parse_plan_trees=on
> -c debug_raw_expression_coverage_test=on
> -c debug_parallel_query=regress
>
> Since we won't have FreeBSD for the first version. I put these options
> to the MacOS task but I couldn't decide where to put
> 'PG_TEST_PG_UPGRADE_MODE: --link'.

Makes sense.

> Also, I am planning to work on back patches when we agree on the
> upstream one. Does that sound good?

Yep.

> diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
> new file mode 100644
> index 00000000000..6d20068727c
> --- /dev/null
> +++ b/.github/workflows/ci.yml
> @@ -0,0 +1,1125 @@
> +# GitHub Actions CI configuration for PostgreSQL
> +
> +name: Github Actions CI
> +
> +on:
> + push:
> + branches: [ "*" ]
> +
> +# Default to the minimum privilege the jobs need (just reading the repo
> +# contents during checkout). Individual jobs override this when they need
> +# more, e.g. `cancel-previous` needs `actions: write` to cancel runs.
> +permissions:
> + contents: read

I'm not sure I like that we ever need more than that. I'd expect that
postgresql-cfbot will explicitly disable write permissions for runs.

> +# NB: intentionally NO workflow-level `concurrency:` block. The native
> +# concurrency mechanism makes a new run wait for the previous one to fully
> +# cancel before it starts — which can take a while. Instead the
> +# `cancel-previous` job below fires a cancel API call asynchronously,
> +# so the new run gets going immediately. On master the cancel job is skipped,
> +# so every push runs to completion.

Is this really worth having our own code? Seems like it'd not be that frequent
to push if there are already running runs? What kind of delays are we talking
about?

> + # To avoid unnecessarily spinning up a lot of VMs / containers for entirely
> + # broken commits, have a minimal task that all others depend on.
> + #
> + # SPECIAL:
> + # - Builds with --auto-features=disabled and thus almost no enabled
> + # dependencies
> + sanity-check:
> + name: SanityCheck
> + needs: setup
> + if: needs.setup.outputs.sanitycheck == 'true'
> + runs-on: ubuntu-latest
> + timeout-minutes: 15
> + container:
> + image: ${{ needs.setup.outputs.linux_ci_image }}
> + env:
> + BUILD_JOBS: 8
> + TEST_JOBS: 8
> + CCACHE_DIR: ${{ github.workspace }}/ccache_dir
> + # no options enabled, should be small
> + CCACHE_MAXSIZE: "150M"
> + steps:
> + - uses: actions/checkout(at)v6
> + with:
> + fetch-depth: ${{ env.CLONE_DEPTH }}
> +
> + - name: Restore ccache
> + uses: actions/cache(at)v5

Seems like this is used by every task. Can we move this into a yaml anchor or
such, by using a variable representing the job name?

> + with:
> + path: ${{ env.CCACHE_DIR }}
> + key: ccache-sanitycheck-${{ github.run_id }}
> + restore-keys: ccache-sanitycheck-

Why is the key here the run id? Doesn't that mean that we will never have a
precise cache match and that we will keep multiple versions of the cache
around? That seems like a waste of cache space?

For efficiency, particularly on cfbot, it seems like it could be useful to
populate the cache of branches with the cache of the master branch. For that
we'd need the branch name in the key. Which I think would also good for
postgres/postgres, as we currently have a lot of interference between runs on
the main and the REL_XY_STABLE branches.

> + - name: Prepare workspace
> + run: |
> + whoami
> + useradd -m postgres
> + chown -R postgres:postgres .
> + mkdir -p "$CCACHE_DIR"
> + chown -R postgres:postgres "$CCACHE_DIR"
> + # Can't change the container's kernel.core_pattern; the postgres
> + # user can't write to / normally. Make / writable.
> + chown root:postgres /
> + chmod g+rwx /

Why not just always use a privileged container?

> + - name: Configure
> + run: |
> + su postgres <<-'EOF'
> + set -e
> + meson setup \
> + --buildtype=debug \
> + --auto-features=disabled \
> + -Ddefault_library=shared \
> + -Dtap_tests=enabled \
> + build
> + EOF
> +
> + - name: Build
> + run: |
> + su postgres <<EOF
> + set -e
> + ninja -C build -j${BUILD_JOBS} ${MBUILD_TARGET}
> + EOF

Should we have an explicit cache upload step here? Or are upload steps run
unconditionally?

> + # Run a minimal set of tests. The main regression tests take too long
> + # for this purpose. For now this is a random quick pg_regress style
> + # test, and a tap test that exercises both a frontend binary and the
> + # backend.
> + - name: Test
> + run: |
> + su postgres <<EOF
> + set -e
> + ulimit -c unlimited
> + meson test ${MTEST_ARGS} --suite setup
> + meson test ${MTEST_ARGS} --num-processes ${TEST_JOBS} \
> + cube/regress pg_ctl/001_start_stop
> + EOF
> +
> + - name: Core backtraces
> + if: failure()
> + run: |
> + mkdir -m 770 /tmp/cores
> + find / -maxdepth 1 -type f -name 'core*' -exec mv '{}' /tmp/cores/ \;
> + src/tools/ci/cores_backtrace.sh linux /tmp/cores
> +
> + - name: Upload logs
> + if: failure()
> + uses: actions/upload-artifact(at)v7
> + with:
> + name: sanitycheck-logs-${{ github.run_id }}
> + path: |
> + build*/testrun/**/*.log
> + build*/testrun/**/*.diffs
> + build*/testrun/**/regress_log_*
> + build*/meson-logs/*.txt
> + if-no-files-found: ignore

I think this really should be in a yaml anchor, we have a few somewhat
different versions of this now.

It's pretty annoying that the output of the failures isn't visible in the UI.
Maybe we ought to print a few of the failures out or something?

> +
> + # SPECIAL:
> + # - Uses address sanitizer (sanitizer failures are typically printed in
> + # the server log)
> + # - Configures postgres with a small segment size
> + #
> + # Enable a reasonable set of sanitizers. Use the linux task for that, as
> + # it's one of the fastest tasks (without sanitizers). Also several of the
> + # sanitizers work best on linux.
> + #
> + # The overhead of alignment sanitizer is low, undefined behaviour has
> + # moderate overhead. Test alignment sanitizer in the meson task, as it
> + # does both 32 and 64 bit builds and is thus more likely to expose
> + # alignment bugs.
> + #
> + # Address sanitizer in contrast is somewhat expensive. Enable it in the
> + # autoconf task, as the meson task tests both 32 and 64bit.

I wonder if we should split the meson task into two, one for 32bit and one for
64bit. The concurrency limits for public repos are high enough for that to
seem like a reasonable tradeoff? There's no work, other than the repo
checkout, shared between them.

> + # disable_coredump=0, abort_on_error=1: for useful backtraces in case of crashes
> + # print_stacktraces=1,verbosity=2, duh
> + # detect_leaks=0: too many uninteresting leak errors in short-lived binaries
> + linux-autoconf:
> + name: Linux - Debian Trixie - Autoconf
> + needs: [setup, sanity-check]
> + if: |
> + !cancelled() &&
> + needs.setup.outputs.linux == 'true' &&
> + needs.sanity-check.result != 'failure'
> + runs-on: ubuntu-latest
> + timeout-minutes: 60
> + container:
> + image: ${{ needs.setup.outputs.linux_ci_image }}
> + # Share the host PID + IPC namespaces. 017_shm.pl rapidly creates,
> + # kill9's, and restarts postgres; with the container's small PID
> + # space a new postgres can recycle the dead postmaster's PID before
> + # pg_ctl's postmaster.pid check notices, producing spurious "node X
> + # is already running" failures. SysV shm in the test also relies on
> + # host-like IPC behavior.
> + #
> + # --ulimit raises memlock and core dump size. Memlock is needed for
> + # running the AIO tests.
> + #
> + # --privileged is needed so the prepare step can write to sysctls
> + # under /proc/sys (it's mounted read-only without it). We use it to
> + # set kernel.core_pattern.
> + options: --pid=host --ipc=host --ulimit memlock=-1:-1 --privileged
> + env:
> + BUILD_JOBS: 4
> + TEST_JOBS: 8
> + CCACHE_DIR: /tmp/ccache_dir
> + DEBUGINFOD_URLS: "https://debuginfod.debian.net"
> +
> + SANITIZER_FLAGS: -fsanitize=address
> + UBSAN_OPTIONS: print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2
> + ASAN_OPTIONS: print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0
> + CFLAGS: -Og -ggdb -fno-sanitize-recover=all -fsanitize=address
> + CXXFLAGS: -Og -ggdb -fno-sanitize-recover=all -fsanitize=address
> + LDFLAGS: -fsanitize=address
> + CC: ccache gcc
> + CXX: ccache g++

There's a fair bit of stuff shared between the meson/autoconf linux
tasks. Previously they used a matrix to reduce that a *bit*. But now it's
entirely duplicated, including stuff that doesn't apply to the current job
(e.g. UBSAN_OPTIONS/ASAN_OPTIONS). And blocks like the following:

> + - name: Prepare workspace
> + run: |
> + useradd -m postgres
> + chown -R postgres:postgres .
> + mkdir -p "$CCACHE_DIR"
> + chown -R postgres:postgres "$CCACHE_DIR"
> + mkdir -m 770 /tmp/cores
> + chown root:postgres /tmp/cores
> + sysctl kernel.core_pattern='/tmp/cores/%e-%s-%p.core'
> +
> + # Hosts for the load balance test
> + cat >> /etc/hosts <<-EOF
> + 127.0.0.1 pg-loadbalancetest
> + 127.0.0.2 pg-loadbalancetest
> + 127.0.0.3 pg-loadbalancetest
> + EOF

> + # Install dependencies via Homebrew rather than Macports. On stock
> + # GH runners macports requires a heavy bootstrap, and the relevant
> + # Postgres deps are all available in brew.

What does "heavy bootstrap" mean?

> + - name: Install dependencies
> + run: |
> + brew update
> + brew install \
> + ccache meson openldap python(at)3(dot)12 tcl-tk
> + # IPC::Run via cpanm (system perl)
> + sudo cpan -T -i IPC::Run IO::Tty

We do spend ~95s on this every run, that's not nothing. And it puts a bunch of
load onto the brew's mirrors to do that every run.

> + - name: Test world
> + run: |
> + ulimit -c unlimited
> + ulimit -n 1024
> + meson test ${MTEST_ARGS} --num-processes ${TEST_JOBS}

I'd re-add the comments that were in .cirrus.yml about this.

> + windows-vs:
> + name: Windows - Server 2022, VS 2022 - Meson & ninja
> + needs: [setup, sanity-check]
> + if: |
> + !cancelled() &&
> + needs.setup.outputs.windows == 'true' &&
> + needs.sanity-check.result != 'failure'
> + runs-on: windows-2022
> + timeout-minutes: 60
> + env:
> + TEST_JOBS: 8
> + # Avoid port conflicts between concurrent tap tests
> + PG_TEST_USE_UNIX_SOCKETS: 1
> + PG_REGRESS_SOCK_DIR: 'c:\pgsock\'

At least my editor gets confused by the \', thinking it's escaping the '. As
everything just works without the trailing \, I'd go that way.

> + # The TAP tests build an initdb template under build/tmp_install and
> + # then `robocopy` it into per-test data directories. Robocopy with the
> + # default /COPY:DAT flag doesn't copy ACLs — destinations inherit from
> + # their parent dir. On GitHub-hosted Windows runners the workspace's
> + # inherited ACL grants Administrators:(F) and Users:(RX) but does NOT
> + # grant the runner user (runneradmin) directly. That matters because
> + # pg_ctl on Windows uses CreateRestrictedProcess to drop admin
> + # privileges from postmaster, so the postmaster process has the user
> + # SID in its token but no longer the Administrators group — leaving it
> + # with only "Users:(RX)" on pg_control and friends, which causes
> + # "PANIC: could not open file global/pg_control: Permission denied".
> + #
> + # Fix it once on the workspace dir with (OI)(CI) inheritance flags so
> + # every file/dir created underneath gets an explicit grant for the
> + # current user.
> + - name: Grant workspace ACL to runner user
> + shell: pwsh
> + run: |
> + icacls "${{ github.workspace }}" /grant "${env:USERNAME}:(OI)(CI)F" /Q | Out-Null
> + Write-Host "Granted Full Control to $env:USERNAME on ${{ github.workspace }}"

Perhaps this would be better to fix by changing the robocopy flags?

> + # postgres' plpython3u loads python3.dll (the stable-ABI forwarder)
> + # which in turn loads whichever python3NN.dll the Windows loader finds
> + # first on PATH. On windows-2022 `C:\Program Files\Mercurial\` ships
> + # its own python3.dll + python39.dll and appears on PATH *before* the
> + # hostedtoolcache Python 3.12 — so without intervention the backend
> + # ends up running Python 3.9 while postgres' stdlib search uses 3.12,
> + # producing `ImportError: cannot import name 'text_encoding' from
> + # 'io'` (the 3.12 `io.py` calling into 3.9's `_io`).
> + #
> + # Pin PYTHONHOME to the Python 3.12 prefix, and prepend that prefix
> + # to PATH so its python3.dll wins the DLL search.
> + - name: Pin Python prefix on PATH and PYTHONHOME
> + shell: pwsh
> + run: |
> + $prefix = (python -c "import sys; print(sys.prefix)").Trim()
> + Add-Content $env:GITHUB_ENV "PYTHONHOME=$prefix"
> + Add-Content $env:GITHUB_PATH $prefix
> + Write-Host "PYTHONHOME=$prefix"
> + Write-Host "Prepended $prefix to PATH"

GRJGJKLJKJDFJKDF.

> + - name: Install dependencies
> + shell: pwsh
> + run: |
> + choco install -y --no-progress --limitoutput diffutils winflexbison
> + # meson + ninja aren't preinstalled on windows-2022. Install via pip
> + python -m pip install --upgrade meson ninja
> +
> + # OpenSSL 1.1 via the slproweb installer (pinned to match the
> + # version used elsewhere in postgres CI).
> + curl.exe -fsSL -o openssl-setup.exe https://slproweb.com/download/Win64OpenSSL-1_1_1w.exe
> + Start-Process -Wait -FilePath ./openssl-setup.exe `
> + -ArgumentList '/DIR=c:\openssl\1.1\ /VERYSILENT /SP- /SUPPRESSMSGBOXES'
> + # The slproweb installer puts libcrypto-1_1-x64.dll / libssl-1_1-x64.dll
> + # in c:\openssl\1.1\bin\ and updates the system PATH. GH Actions
> + # snapshots PATH at job start though, so the running job won't
> + # see those DLLs and initdb.exe would crash silently at runtime.
> + # Push the bin dir onto GITHUB_PATH so it persists for later steps.
> + Add-Content $env:GITHUB_PATH "c:\openssl\1.1\bin"

I don't like that much, but I'm not sure we have a better alternative
short-term.
> + windows-mingw:
> + name: Windows - Server 2022, MinGW64 - Meson
> + needs: [setup, sanity-check]
> + if: |
> + !cancelled() &&
> + needs.setup.outputs.mingw == 'true' &&
> + needs.sanity-check.result != 'failure'
> + runs-on: windows-2022
> + timeout-minutes: 60
> + env:
> + TEST_JOBS: 4 # higher concurrency causes occasional failures
> + PG_TEST_USE_UNIX_SOCKETS: 1
> + PG_REGRESS_SOCK_DIR: 'c:\pgsock\'
> + TAR: "c:/windows/system32/tar.exe"
> + # for mingw plpython to find its installation
> + PYTHONHOME: D:/a/_temp/msys64/ucrt64
> +
> + MSYS: winjitdebug
> + CHERE_INVOKING: 1
> + MESON_FEATURES: >-
> + -Dnls=disabled

Missing comments from .cirrus.tasks.yml

Thanks for working on this!

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2026-05-27 18:20:48 Re: effective_wal_level is not decreasing after using REPACK (CONCURRENTLY)
Previous Message Peter Eisentraut 2026-05-27 17:56:50 Re: Set notice receiver before libpq connection startup