Re: [PING] fallocate() causes btrfs to never compress postgresql files

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Dimitrios Apostolou <jimis(at)gmx(dot)net>, Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject: Re: [PING] fallocate() causes btrfs to never compress postgresql files
Date: 2025-08-17 23:23:04
Message-ID: CA+hUKGL74G_CR50N+gr8N1E9emuFJMHAsEKr6oGmHZZDZjRJHg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 8, 2025 at 1:38 AM Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> On Tue, Aug 5, 2025 at 3:08 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> We discussed that a bit earlier in the thread. Some problems about
>> layering violations and general weirdness, I recall trying it even.
>> On the flip side, is it right to declare very local
>> filesystem-specific choices in a system catalogue that is replicated
>> and affects replicas?
>> What about a fancier GUC that can reference tablespaces?
>
>
> Wouldn't that be something that applies to *all* the tablespace configs then, taht is a proper movement of the goalposts? :) Such as being able to set random_page_cost per tablespace to different values on different machines. I agree that it would be useful though. But it seems like a different patch, if useful, and one that should be generic?

Yeah. And while we're talking pie-in-the-sky future features,
full_page_writes is also describing a property of a particular
server's file system and/or hardware for a given tablespace. Can't do
much about that today, as it can only be decided by the primary node
that must log full pages or not, but its potential replacement
"atomic_double_write" (as I call it) *can* be chosen on a per-server
basis in a replication chain. We could probably have done that
independently, but it gets easier with new infrastructure for
streaming large asynchronous combined writes...

To solve Dimitrios's real production issue, I am planning to proceed
with the simple whole-system GUC(s) already posted, after I've done
some light testing on ZFS (which has similar design constraints though
makes different choices) and thought a bit harder about the
Windows/NTFS situation. I'll post a new version before pushing
anything. My plan is to have this in the next minor release, unless
the upcoming 18 release forces me to delay it until the one after.

Another thing I noticed is that macOS has its own funky way[1] of
preallocating disk space that looks plausibly relevant. Not
investigated and not planning to work on that myself necessarily but
it might be worth thinking for a moment about the GUC future-proofing
implications.

[1] https://github.com/libgit2/libgit2/commit/bd132046b04875f928e52d16363fb73f8e85dded

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-08-17 23:57:13 make -C src/test/isolation failure in index-killtuples due to btree_gist
Previous Message Marthin Laubscher 2025-08-17 21:11:06 Re: About Custom Aggregates, C Extensions and Memory