Re: [PING] fallocate() causes btrfs to never compress postgresql files

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dimitrios Apostolou <jimis(at)gmx(dot)net>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Subject: Re: [PING] fallocate() causes btrfs to never compress postgresql files
Date: 2025-05-31 14:33:27
Message-ID: 4453f831-bcbe-49e2-88ed-747f0abbdebb@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/31/25 16:00, Thomas Munro wrote:
> On Fri, May 30, 2025 at 3:58 AM Dimitrios Apostolou <jimis(at)gmx(dot)net> wrote:
>> All I'm saying is that this is a regression for PostgreSQL users that keep
>> tablespaces on compressed Btrfs. What could be done from postgres, is to
>> provide a runtime setting for avoiding fallocate(), going instead through
>> the old code path. Idelly this would be an option per tablespace, but even
>> a global one is better than nothing.
>
> Here's an initial sketch of such a setting. Better name, design,
> words welcome. Would need a bit more work to cover temp tables too.
> It's slightly tricky to get smgr to behave differently because of the
> contents of a system catalogue! I couldn't think of a better way than
> exposing it as a flag that the buffer manager layer has to know about
> and compute earlier, but that also seems a bit strange, as fallocate
> is a highly md.c specific concern. Hmm.
>

I find the definition of io_min_fallocate confusing, or rather that 0
means "never" instead of "always". It's described as a "threshold at
which to start using fallocate", so I'd expect 0 to mean "always"
because (len >= 0).

I suggest to use "-1" to mean never and "0" always, as for other similar
settings (e.g. log_min_duration_statement or log_lock_waits).

> I suppose something like the 0001 part could be back-patched if this
> is considered a serious enough problem without other workarounds, so I
> did this in two steps. I wonder if there are good reasons to want to
> change the number on other file systems. I suppose it at least allows
> experimentation.

Maybe. It'd need to get some of the 0002 bits too, ofc.

I'm not sure we really want all these special GUC tailored for different
filesystems. We already have a few such GUCs, it's getting tricky to
know which ones to set / not set, and it also changes with the
filesystem version ... I personally don't know which ones to set, a lot
of the knowledge is somewhat outdated I think.

Wouldn't it be better for btrfs to just start returning EOPNOTSUPP
(maybe with a mount option), in which case we already do the right thing
automatically already? Sure, it means the admin needs to be aware of
this in both cases.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-05-31 15:42:01 Re: [PING] fallocate() causes btrfs to never compress postgresql files
Previous Message Thomas Munro 2025-05-31 14:32:45 Re: [PING] fallocate() causes btrfs to never compress postgresql files