Re: Adding CI to our tree

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Adding CI to our tree
Date: 2021-12-17 19:31:59
Message-ID: 20211217193159.pwrelhiyx7kevgsn@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-12-17 12:34:36 +0100, Peter Eisentraut wrote:
> On 13.12.21 22:12, Andres Freund wrote:
> > Attached is an updated version of the CI patches. An example of a test run
> > with the attached version of this
> > https://cirrus-ci.com/build/6501998521483264
>
> + only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' ||
> $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*freebsd.*'
>
> I'm not in favor of this kind of thing. I don't understand how this is
> useful, other than perhaps for developing *this* patch. I don't think
> people would like having these tags in the mainline, and if it's for local
> use, then people can adjust the file locally.

Definitely not for mainline. But it's extremely useful for development. If you
iteratively try to fix windows, running all the other tests can be slower -
there's a concurrency limit in how many tests you can run for free...

> + CC="ccache cc" CFLAGS="-O0 -ggdb"'
>
> I don't think using -O0 is the right thing. It will miss some compiler
> warnings, and it will not thoroughly test the compiler. We should test
> using the configurations that are close to what users actually use.

Hm. I personally always end up using -O0 for the actual development tree, and
it seems a lot of others do as well. Building with -O2 makes backtraces etc
just less useful.

> + - su postgres -c 'gmake -s -j3 && gmake -s -j3 -C contrib'
>
> Why doesn't this use make world (or world-bin, if you prefer).

I started working on this well before world-bin existed. And using 'world' as
the target builds the docs, which is quite expensive... I happened to actually
make the change to world-bin yesterday for the next version to send :)

> Why does this use -j3 if there are two CPUs configured? (Perhaps the number
> of CPUs should be put into a variable.)

I tried a few and that worked best.

> I don't like that the -s option is used. I would like to see what commands
> are executed.

I can change it - but it makes it *much* harder to spot compiler warnings.

> + cpu: 4
>
> Why does the Linux job use 4 CPUs and the FreeBSD job 2?

I'll add a comment about it. Two reasons
1) the limits on cirrus are lower for linux than freebsd:
https://cirrus-ci.org/faq/
2) There's some issues on freebsd where test performance regressess *very*
substantially with higher concurrency. Thomas and I looked a bunch into it
without figuring out the details.

> + - export

> I don't think that is portable to all shells.

Doesn't really need to be?

> + - su postgres -c 'time script test.log gmake -s -j2 ${CHECK}
> ${CHECKFLAGS}'
>
> + su postgres -c '\
> + ulimit -c unlimited; \
> + make -s ${CHECK} ${CHECKFLAGS} -j8 \
> + '
>
> Not clear why these are so different. Don't we need the test.log file for
> Linux?

There's a comment about the use of script:
# Use of script is to avoid make complaints about fcntl()
# https://savannah.gnu.org/bugs/?60774

that bug is specific to platforms that don't allow locking pipes. Which linux
does allow, but freebsd doesn't.

> Don't we need the ulimit call for FreeBSD?

I think the default core limits were different, I will check.

> Why the -j8 option even though 4 CPUs have been configured?

That might have been an accident.

> + brew install \
> + ccache \
> + coreutils \
> + icu4c \
> + krb5 \
> + llvm \
> + lz4 \
> + make \
> + openldap \
> + openssl \
> + python(at)3(dot)10 \
> + tcl-tk
>
> Curious why coreutils and make are installed? The system-supplied tools
> ought to work.

make because newer versions of make have -Otarget, which makes concurrent
check-world output at least kind-of readable.

> + brew cleanup -s
>
> Seems unnecessary?

It reduces the size of the cached downloads. Not much point in keeping older
versions of the package around. Populating the cache takes time.

> + PKG_CONFIG_PATH="/usr/local/opt/krb5/lib/pkgconfig:$PKG_CONFIG_PATH"
>
> AFAICT, we don't use pkg-config for the krb5 package.

I now converted this to a loop.

> + - gmake -s -j12 && gmake -s -j12 -C contrib
>
> These numbers should also be explained or configured somewhere. Possibly
> query the number of CPUs on the instance.

macOS instances have a fixed number of cores - 12. Might make sense to query
it, but not sure what a good portable way there is.

> + PROVE_FLAGS: -j10
>
> Why only on Windows?

Because windows doesn't have a way to run tests in parallel in another
way. prove-level concurrency is the only thing. Whereas other platforms can
run tests in parallel via make. Combining both tends to not work very well in
my experience.

> + # Installation on windows currently only completely works from
> src\tools\msvc
>
> If that is so, let's fix that.

I did report the problem - just haven't gotten around to fixing it. Note this
is also how the buildfarm invokes installation... The problem is that
Install.pm includes config.pl from the current directory, IIRC.

At some point I needed to restrict to dealing with the current state - there's
plenty other bugs.

> + - cd src\tools\msvc && perl .\install.pl
> %CIRRUS_WORKING_DIR%\tmp_install
>
> Confusing mix of forward and backward slashes in the Windows section. I
> think forward slashes should work everywhere.

They would work here I think, but no, they don't work everywhere :(

> + test_plcheck_script:
> + - perl src/tools/msvc/vcregress.pl plcheck
>
> etc. Couldn't we enhance vcregress.pl to take multiple arguments or take a
> "check-world" argument or something. Otherwise, this will be tedious to
> keep updated.

> + test_subscriptioncheck_script:
> + - perl src/tools/msvc/vcregress.pl taptest .\src\test\subscription\
>
> This is even worse. I don't want to have to hand-register every new TAP
> test.

I strongly agree. There were several tests that the buildfarm on windows
didn't ever run before I started working on this. And clearly no windows
developer is going to manually invoke ~10 test steps.

> + always:
> + gcc_warning_script: |
> + time ./configure \
> + --cache gcc.cache CC="ccache gcc" \
> + --enable-dtrace
>
> I don't know why we wouldn't need the full set of options here. It's not
> like optional code never has compiler warnings.

I mostly didn't like the repetition of long argument lists. There's probably a
decent way to deal with that.

> + # cross-compile to windows
> + always:
> + mingw_cross_warning_script: |
>
> I would welcome a native mingw build with full options and test suite run.
> This cross-compiling stuff is of course interesting, but I'm not sure why it
> is being preferred over a full native run.

I have a new colleague working on scripting the setup of mingw on
windows. Besides not being available yet, it's *much* *much* slower to build
postgres. This is a useful and quicker screening.

> --- /dev/null
> +++ b/.dockerignore
> @@ -0,0 +1,7 @@
> +# Ignore everything, except ci/
>
> I wonder whether this would interfere with other uses of docker. I suppose
> people might have their custom setups for building docker images from
> PostgreSQL sources. It seems weird that this file gets this prominence,
> saying that the canonical use of docker inside PostgreSQL sources is for
> Cirrus CI.

I have a hard time seeing uses of docker where this would be a problem. If you
actually use the whole tree as context you pretty much need to exclude most of
it, otherwise docker (and other tools) tar the whole tree and send it to the
daemon.

> It would be useful if the README explained the use of docker. As I
> mentioned before, it's not immediately clear why docker is used at all in
> this.

It boils down to this: Windows tests on cirrus are always run via docker
(presumably because the licensing otherwise is more expensive). And for linux,
it's considerably quicker to start up a container, rather than a full VM - but
a full VM is slower.

> The docker file for Windows contains a lot of hardcoded version numbers.
> This should at least be refactored a little bit so that it is clear which
> version numbers should be updated and how. Or better yet, avoid the need to
> constantly update version numbers.

I don't really see a great way to avoid them in general. And e.g. with perl
the issue is that plperl straight up doesn't work with a newer perl version :(.

> Also, the way OpenSSL is installed looks a bit fishy. Is this what people
> actually use in practice? How can we make it match actual practice better?

I wish I knew. I didn't see any good practice for this anywhere.

> +# So that tests using the "manually" started postgres on windows can use
> +# prepared statements
> +max_prepared_transactions = 10
>
> Maybe add that to the pg_ctl invocation in the Windows section instead.

It doesn't hurt anything else, so I don't really think it's worth going a
platform dependent way?

> +# Settings that make logs more useful
> +log_autovacuum_min_duration = 0
> +log_checkpoints = true
> +log_connections = true
> +log_disconnections = true
> +log_line_prefix = '%m [%p][%b] %q[%a][%v:%x] '
> +log_lock_waits = true
>
> If we think these are useful, we should make the test suite drivers set them
> for all users.

Some of these are set for some but not some other test drivers. And it's quite
useful for out-of-tree development to be able to quickly add options to all
tests... And no test driver helps the windows case that needs the manually
started postgres (the state of windows testing really is dreadful).

> For the above reasons of lack of documentation, I still don't understand the
> whole docker flow here. Do you mean, the docker files included in your
> patch are not actually used as part of the CI run; instead you use them to
> build images manually, which are then pulled in by the test runs?

As submitted this is about VM images, not docker images. Used by linux (for
the main test run) and freebsd (cirrus doesn't support anything else). Some
other platforms could also be supported that way with a bit more work
(openbsd, netbsd). For some other platforms cirrus only supports docker
containers (windows, linux on arm).

The docker containers can be built on-demand (that's what the dockerfile:
... syntax for e.g. windows does). So yes, they currently are used. cirrus
rebuilds them whenever their content (including referenced files) changes.

But the building of containers happens in every repo enabling the tests. That
takes quite a while and uses a lot of space (the additional windows
installation is like ~6GB). So I was working over the last few days on moving
the containers to be built the alongside the VM images. Then it looks more
similar across all the platforms.

> I think it would be much easier to get this project over the initial hump if
> we skipped the whole docker business and just set the images up from scratch
> on each run.

It's not feasible. The windows stuff takes *way* too long (as in 40min),
freebsd takes quite long, linux long. And the amount of traffic it generates
to install packages over and over again isn't great either. The containers /
images are from within google's network, and thus free - the package repos
don't have that advantage.

FWIW, homebrew at some point complained about a huge fraction of their cost
being from CI. I know debian had issues with that on and off as well. While I
couldn't solve the homebrew issue via VM images, I made it at least cache the
homebrew downloads between runs.

My current (not yet submitted version) comment about this is:

> Images used for CI
> ==================
>
> To keep CI times tolerable, most platforms use pre-generated images. Some
> platforms use containers, others use full VMs. Images for both are generated
> separately from CI runs, otherwise each git repository that is being tested
> would need to build its own set of containers, which would be wasteful (both
> in space and time.
>
> These images are built from the specifications in github.com/anarazel/pg-vm-images/

I also did the work of redoing the necessary setup and documenting all the
steps ([1] although there could be a bit more handholding) . It'd now not be
hard to transfer the image building into different / shared ownership.

I'll expand this section.

> > The second attention-worthy point is the choice of a new toplevel ci/
> > directory as the location for resources referencenced by CI. A few other
> > projects also use ci/, but I can also see arguments for moving the contents to
> > e.g. src/tools/ci or such?
>
> Or src/tools/cirrus/? This providers came and go, and before long there
> might be interest in another one.

I think quite a bit of the work will be portable between CI providers. I think
having all the different CI things in one directory will make it more obvious
than having a bunch of provider names one might or might not know.

I'm imaginging that over time we'd put some of the stuff in the .cirrus.yml
file into scripts in src/tools/ci/, so they can be used by different CI
providers.

> > - Extend the set of compiler warnings - as the compiler version is controlled,
> > we could be more aggressive than we can be via configure.ac.
>
> Not sure about that. I don't want this to evolve into some separate pool of
> policies that yells at you because of some settings that you never heard of.
> If we think other warnings are useful, we should provide a way to select
> them, perhaps optionally, from the usual build system.

I'd be happy with that as well. I do think that the compiler version
dependency makes it bit harder to this usefully from configure though.

> > - consider enable compile-time debugging options like COPY_PARSE_PLAN_TREES,
> > and run-time force_parallel_mode = regress on some platforms. They seem to
> > catch a lot of problems during development and are likely affordable enough.
>
> That would be useful if we can think of a way to select it optionally.

What kind of optionally are you thinking? Commit message contents? A central
flag somewhere?

I was thinking it might be worth enabling such options on one of the
platforms, so that one gets coverage by default. People remembering
e.g. COPY_PARSE_PLAN_TREES don't tend to need it...

Greetings,

Andres Freund

[1] https://github.com/anarazel/pg-vm-images/blob/main/gcp_project_setup.txt

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-12-17 19:34:36 Re: Adding CI to our tree
Previous Message Justin Pryzby 2021-12-17 19:10:28 Re: In-placre persistance change of a relation