Quick Links

Re: Heads Up: cirrus-ci is shutting down June 1st

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject:	Re: Heads Up: cirrus-ci is shutting down June 1st
Date:	2026-06-02 18:38:37
Message-ID:	lal6n3ym3ukug26bhoklqp6xuuxi7psqdmi7knjawdwzyubmv4@shzlq5c75z5q
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2026-06-01 12:01:58 +0200, Jakub Wartak wrote:
> So I've spent half of day on trying to see what makes the tests so slow at
> least in my case. I can also confirm %CPU combined (with high 33% sys).

Was this locally on your machine? I assume that's without enabling
sanitizers?

In CI the bottleneck clearly is CPU at the moment, due to the relatively now
number of cores.

To reduce IO, one pretty significant thing we can do is to reduce the segment
size used during tests. Creating lots of 16MB segments when most of them are
only very partially used isn't free.

> 0. baseline was ~71s (stuff already hot)
> 1a. down to 64s with dirtywriteback tune (and mostly to avoid NVMe/SSD wear)
> 1b. ~65s with tmpfs, so I've left using dirtywriteback sysctls:
> sudo mount -t tmpfs -o size=4G,uid=XXX,mode=755 tmpfs build/tmp_install
> sudo mount -t tmpfs -o size=16G,uid=XXX,mode=755 tmpfs /build/testrun

I don't think we should do that, real FS behaviour is something we do IMO want
to test.

> 2. Splitting the tests (isolation, 027_stream_regress, pg_upgrade) into 4
> parallel streams of each did not help much (they are longest ones)

> 3. I've spotted the falcon-sensor (EDR agent, using eBPF) very busy, so
> I've shut it down, got the duratiion down to 43s.

Heh.

> 4. Still for that 43s dominant factor was the mmap/page-fault/PTEs related
> to the number of backends we spawn. Literally later when I put
> Claude to work he said to me this "Backend startup costs roughly 2.5x
> as much as the actual queries". And later when I've pushed to count using
> log_connections it said "Got 24,903 total connections in 46 s = 541
> backend forks/second." and got this top report:
> 8,610 subscription - 35 % of all connections in the suite
> 4,382 recovery - 18 %

Hah. I wonder how much of this is just polling for catchup and such. Which we
should totally make smarter (e.g. using WAIT FOR in more places and making
poll_query_until() have adaptive sleep times).

> 1,100 pg_upgrade
> 896 isolation
> 694 pg_dump
> 682 pg_basebackup
>
> Fixing above subscription to ~5000 conns did not gain much (well it saved
> 5% of runtime 43s -> 41s). It's literally 10k lines of
> s/$node_subscriber->safe_psql/sub_bg->query_safe/g across dozens of files
> in src/test/subscription/t/). Too big for review and I'm not sharing as
> it could contain errors.

Did you test the effect of those changes on windows (via CI)? I'd expect that
big a reduction to have a substantially bigger effect there.

> 5. Spotted that we do plenty of initdb and cached-initdb (cp), so I had idea
> about XFS's cp reflinks=always in build/, but I couldn't do that without
> /dev/loop, so apparently XFS (reflink=1) vs ext4(reflink=0) halves number
> of writes while even still on /dev/loop device, but that somehow
> does not directly contribute to duration of the test (well we are
> bottlenecked on CPU anyway, so this is just smarter? way of avoiding I/O;
> maybe with cold-caches and on real VMs running with XFS would be faster)
>
> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> @@ -687,7 +687,13 @@ sub init
> }
> else
> {
> - @copycmd = qw(cp -RPp);
> + @copycmd = qw(cp --reflink=always -RPp);

Afaict cp uses reflinks automatically by default, if the filesystem supports
it. On CI it's not supported due to ext4, but locally it seems to work for
me.

> Other interesting ideas: pg_regress with built-in connection pool (IMHO not
> worth it), mitigations=off (to avoid syscalls being taxed, got not
> improvement with this).

I really doubt that the number of connections pg_regress establishes matter in
comparison to the amount of work done per connection in pg_regress style
tests.

Greetings,

Andres Freund

In response to

Re: Heads Up: cirrus-ci is shutting down June 1st at 2026-06-01 10:01:58 from Jakub Wartak

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2026-06-02 18:43:31	Re: Heads Up: cirrus-ci is shutting down June 1st
Previous Message	Peter Eisentraut	2026-06-02 18:08:53	Re: Heads Up: cirrus-ci is shutting down June 1st