Re: ci: CCache churns through available space too quickly

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: ci: CCache churns through available space too quickly
Date: 2026-06-08 10:30:03
Message-ID: CAN55FZ3BP3m81azZmS+qvb9ndPD-GDHvh0fAo7mL0UONsgmqFQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thank you for working on this!

On Fri, 5 Jun 2026 at 23:09, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> I noticed that a handfull of CI runs already lead to exceeding the available
> cache space. One can pay for more cache space, but I think the problem is
> more that what we currently do doesn't work well.
>
> With cirrus-ci all branches shared one cache, but that's not the case with
> github actions. Except for being able to read caches from the default branch
> (master in our case), other branches have completely separate cache
> namespaces. That's probably the right call, safety wise, but makes our ccache
> approach .. not great.
>
> We should only upload a new cache when the ccache cache hit ratio of the
> existing cache entry has gotten low.

This makes sense.

> We also chose the cache key unfortunately, so that if a branch name started
> with the name of the default branch, followed by a -, we'd always end up using
> the main branches cache.
>
>
> The attached patch fixes these, and a few other problems. See commit message
> for details. With it I see a lot less cache churn and therefore also a higher
> hit rate once one has more than 2-3 branches.
>
>
> I'm not entirely happy with the amount of per job repetition this has. While
> staying within the confines of a single .yml file, I couldn't find a better
> way to deal with that. We could move a fair bit of that complexity into a
> separate file, using so called "composite actions". But that's a bit of
> additional github actions specific stuff that one would be exposed to, so I'm
> not sure we should go that way?

I think it looks okay, no need to use composite actions for this.

--------------------

I tested the patch and I confirm that it works as mentioned. Here is my review:

All the points you explained in the commit message are nice improvements!

Typo in commit message:

+ In my testing this utilizes the available cache space (10GB for personal
+ accounts) much more effictively than before.

Typo at 'effictively' in the commit message.

diff --git a/.github/workflows/pg-ci.yml b/.github/workflows/pg-ci.yml
index 8560e9389f6..86dc47de8db 100644
--- a/.github/workflows/pg-ci.yml
+++ b/.github/workflows/pg-ci.yml

+ - &ccache_decide_save_step
+ name: "ccache: Decide if cache should be uploaded"
+ id: ccache-pre-save
+ # [Decide to] store the cache whenever the cache was set up, so that
+ # incrementally addressing compiler errors/warnings doesn't have to
+ # start from scratch.
+ if: |
+ always() &&
+ steps.ccache-restore-branch.conclusion == 'success'
+ run: python3 src/tools/ci/gha_ccache_decide.py

Isn't the conclusion always true unless GitHub has some self errors?

Also, we are directly running this script with the 'python3' command
but it might not be available on the PATH. I had some problems with
this on BSD images when we were using Cirrus. I am not sure we would
have such problems with GitHub Actions but I just wanted to mention
it.

diff --git a/src/tools/ci/gha_ccache_decide.py
b/src/tools/ci/gha_ccache_decide.py
new file mode 100644
index 00000000000..920f7bf9685
--- /dev/null
+++ b/src/tools/ci/gha_ccache_decide.py

+def main():
+ on_default_branch = os.environ["ON_DEFAULT_BRANCH"] == "true"
+ ccache_dir = os.environ["CCACHE_DIR"]

ccache_dir isn't used.

+ # compute cache hit ratio
+ hits, misses = parse_ccache_stats()
+ total = hits + misses
+ hit_pct = int(( hits / total) * 100) if total > 0 else 100

Extra space in '( hits'.

+ # If there were either barely any misses, or the cache hit ratio was high,
+ # there no point in generating a new cache entry. We have limited cache
+ # space.
+ should_save = misses > 10 and hit_pct < target_rate

We consider misses here but we don't mention it, we only mention hit
rate and target rate. I think this is not very important since we
can't possibly have a case that misses < 10 and hit_pct < target_rate.
If that is not the case, then I think we can remove misses from the
should_save calculation.

+ # Don't store ccache stats , otherwise we'd need to reset the cache access

Extra space before comma.

--
Regards,
Nazir Bilal Yavuz
Microsoft

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2026-06-08 10:48:27 Re: pg_rewind does not rewind diverging timelines
Previous Message jian he 2026-06-08 10:28:55 Re: Fix bug of CHECK constraint enforceability recursion