Re: Heads Up: cirrus-ci is shutting down June 1st

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: Heads Up: cirrus-ci is shutting down June 1st
Date: 2026-06-01 10:01:58
Message-ID: CAKZiRmzXiF3Gwq6BSCA6jO1gQ+kxqf6v3Tim6V2ZFwthnf6gTw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andres,

On Fri, May 29, 2026 at 5:56 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2026-05-29 13:38:17 +0200, Jakub Wartak wrote:
> > On Fri, May 29, 2026 at 11:51 AM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
> > [..]
> > Hi, thanks to everybody for working on this.
> >
> > > https://github.com/nbyavuz/postgres/actions/runs/26628396798
> >
> > Windows (runs-on: windows-2022) seems kind of slow isn't it ?
> >
> > Maybe that's not related to the patch itself, but any idea why the windows
> > tests are so slow? Or will we able to somehow accelerate those?
> >
> > Windows - VS - Meson & ninja / succeeded [..] minutes ago in 31m 28s
> >
> > Processor(s): 1 Processor(s) Installed.
> > [..]
> > Total Physical Memory: 16,379 MB
> > [..]
> >
> > but:
> > NUMBER_OF_PROCESSORS=4
> > [..]
> > + TEST_JOBS: 8
> >
> > vs
> >
> > 392/396 test_json_parser - postgresql:test_json_parser/002_inline
> > OK 152.56s 3712 subtests passed
> > 393/396 pgbench - postgresql:pgbench/001_pgbench_with_server
> > OK 574.61s 474 subtests passed
> > 394/396 pg_rewind - postgresql:pg_rewind/002_databases
> > OK 772.86s 10 subtests passed
> > 395/396 pg_waldump - postgresql:pg_waldump/001_basic
> > OK 771.19s 156 subtests passed
> > 396/396 libpq_pipeline - postgresql:libpq_pipeline/001_libpq_pipeline
> > OK 395.76s 23 subtests passed
> >
> > while last CirrusCI run for me for Windows took 19min 21s (4 CPUs / 4 GBs,
> > but sysinfo reported there "Total Physical Memory: 16,380 MB").
>
> The difference here likely is due to the different type of CPU cores. On
> cirrus, we got 4 non-SMT cores (because the type of CPU used didn't use SMT),
> whereas on GHA we have 4 hardware threads, but only two real cores.
>
>
> > If that's IO traffic as Andres described, maybe we could enable feature
> > called "Turn off Windows write-cache buffer flushing on the device"
> > in device manager -> disk -> policies, but dunno how much that would
> > help really as we seem to be already using fsync=off, maybe it helps
> > when saving other files too (???)
>
> I think I was wrong about IO being the main issue. I've measured the CPU
> utilization during a linux run, and basically it's 100% busy during the whole
> test run (baring the first and last few seconds). Which does seem to mainly
> point to the difference being simply that we just have half the real cores as
> we had before.
>
> I do see higher %sys CPU utilization than I'd expect, so that may be worth
> investigating.

So I've spent half of day on trying to see what makes the tests so slow at
least in my case. I can also confirm %CPU combined (with high 33% sys).

0. baseline was ~71s (stuff already hot)
1a. down to 64s with dirtywriteback tune (and mostly to avoid NVMe/SSD wear)
1b. ~65s with tmpfs, so I've left using dirtywriteback sysctls:
sudo mount -t tmpfs -o size=4G,uid=XXX,mode=755 tmpfs build/tmp_install
sudo mount -t tmpfs -o size=16G,uid=XXX,mode=755 tmpfs /build/testrun
2. Splitting the tests (isolation, 027_stream_regress, pg_upgrade) into 4
parallel streams of each did not help much (they are longest ones)
3. I've spotted the falcon-sensor (EDR agent, using eBPF) very busy, so
I've shut it down, got the duratiion down to 43s.
4. Still for that 43s dominant factor was the mmap/page-fault/PTEs related
to the number of backends we spawn. Literally later when I put
Claude to work he said to me this "Backend startup costs roughly 2.5x
as much as the actual queries". And later when I've pushed to count using
log_connections it said "Got 24,903 total connections in 46 s = 541
backend forks/second." and got this top report:
8,610 subscription - 35 % of all connections in the suite
4,382 recovery - 18 %
1,100 pg_upgrade
896 isolation
694 pg_dump
682 pg_basebackup

Fixing above subscription to ~5000 conns did not gain much (well it saved
5% of runtime 43s -> 41s). It's literally 10k lines of
s/$node_subscriber->safe_psql/sub_bg->query_safe/g across dozens of files
in src/test/subscription/t/). Too big for review and I'm not sharing as
it could contain errors.

5. Spotted that we do plenty of initdb and cached-initdb (cp), so I had idea
about XFS's cp reflinks=always in build/, but I couldn't do that without
/dev/loop, so apparently XFS (reflink=1) vs ext4(reflink=0) halves number
of writes while even still on /dev/loop device, but that somehow
does not directly contribute to duration of the test (well we are
bottlenecked on CPU anyway, so this is just smarter? way of avoiding I/O;
maybe with cold-caches and on real VMs running with XFS would be faster)

+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -687,7 +687,13 @@ sub init
}
else
{
- @copycmd = qw(cp -RPp);
+ @copycmd = qw(cp --reflink=always -RPp);

Other interesting ideas: pg_regress with built-in connection pool (IMHO not
worth it), mitigations=off (to avoid syscalls being taxed, got not
improvement with this).

As for the Windows, I don't have better idea than the just avoid I/O if possible
("Turn off Windows write-cache buffer flushing on the device"), sorry(!), and
maybe throwing in more bigger box... ;]

-J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2026-06-01 10:44:34 Re: [(known) BUG] DELETE/UPDATE more than one row in partitioned foreign table
Previous Message Matheus Alcantara 2026-06-01 09:38:56 Re: glob support in extension_control_path/dynamic_library_path?