| From: | Marcel Menzel <marcel(at)menzel(dot)de> |
|---|---|
| To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
| Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
| Subject: | Re: pg_upgrade reflink support on OpenZFS |
| Date: | 2025-11-15 16:16:56 |
| Message-ID: | 5fd60425-db26-4700-b716-5be3762acd33@menzel.de |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
On 15/11/2025 05:17, Thomas Munro wrote:
> On Sat, Nov 15, 2025 at 7:16 AM Marcel Menzel <marcel(at)menzel(dot)de> wrote:
>> For the PostgreSQL upgrade to version 18, I took the opportunity to test
>> the reflink support in pg_upgrade (with --clone) on OpenZFS 2.3.4 /
>> Linux 6.15.11 and it worked flawlessly, being a huge time saver here.
>
> Nice!
>
>> I've looked into the documentation for pg_upgrade and it's only
>> mentioning btrfs and XFS on Linux and not FreeBSD at all, so I thought
>> It'd be an interesting heads-up to report that Linux gained a 3rd FS and
>> also I think FreeBSD in general the ability for doing reflink copies.
>
> It does mention both Linux and FreeBSD under --copy-file-range. I
> didn't try to list all the relevant file systems there though, partly
> because I didn't feel like documenting all the quirks (only works if
> you created your XFS file system with the feature enabled, might need
> to frobnicate ZFS sysctl, which NFS clients and servers can push it
> down, likewise for non-COW file systems and device drivers, etc etc).
> It might be nice to find a decent reference for all that stuff
> somewhere else and point to it, but I don't think we can maintain that
> accurately ourselves.
>
> I was actually surprised to hear that ioctl(dest_fd, FICLONE, src_fd)
> worked for you. I knew that it was really BTRFS's ioctl and XFS
> accepted it too, but I didn't know that ZFS also understood it[1] in
> 2.3. They apparently didn't really expect anyone to call it, and
> since ZFS 2.4 is apparently about to ship without it[2], it seems like
> a bad time to add it to the documentation for --clone.
Oh, I haven't had any looks at upcoming versions yet, but yeah this
doesn't make any sense then to mention this.
>> OpenZFS has been supporting this since 2.2 but has had it disabled due
>> to data corruption bugs, now since 2.3 the sysctl (zfs_bclone_enabled on
>> Linux, vfs.zfs.bclone_enabled on FreeBSD) has been enabled by default so
>> only the zpool feature "block_cloning" has to be enabled, which might be
>> the case when running "zpool upgrade".
>
> Yeah, those data corruption reports (which turned out to be
> misattributed IIRC?) provided one reason to keep the old BTRFS ioctl()
> under --clone but add the new behaviour under --copy-file-range.
> --copy-file-range should work for all COW filesystems on Linux via
> proper VFS entrypoints, and is the official way to do this from user
> space. Perhaps we should eventually harmonise this under a single
> option and drop the ioctl() stuff. One semantic change would be that
> copy_file_range() means "copy with your best trick" (could be cloning,
> network/driver pushdown or user space buffer copy, silently selecting
> the behaviour), while the BTRFS ioctl() means "clone or fail" IIRC, so
> that was another reason to want a separate option for now.
I haven't looked close at the copy_file_range() syscall and how tools
interact with it in detail yet, but I've found this[3] interesting
GitHub comment which gives me a clearer picture now. Totally
understandable why the OpenZFS remove the compat for those BTRFS
syscalls since they now have a proper replacement.
Peeking at the OpenZFS docs[4][5], they also mention the
copy_file_range() syscall invoking the BRT, so I guess I'll use
pg_upgrade with --copy-file-range the next time.
> For reference, the macOS copyfile() call used for --clone has flags
> that should cause it to fail if it can't clone IIUC, while the Windows
> CopyFile() call used for --copy might even clone blocks on ReFS even
> if you don't specify --clone... huh.
>
>> I haven't had the possibility to check this on FreeBSD yet, but I don't
>> see why this should not work as I also can't spot anything in the
>> OpenZFS docs regarding reflink / block cloning limitations on FreeBSD.
>> Also I saw one of the OpenZFS devs writing on Reddit about block cloning
>> being supported on FreeBSD v14.
>
> It always succeeds on FreeBSD, but it only actually clones if you set
> vfs.zfs.bclone_enabled=1. I've tested all our "clone" features with
> that and they work nicely. The sysctl wasn't on by default in FreeBSD
> 14.x, but 15 is about to ship and the "experimental" label was removed
> in man 4 zfs.
>
> If you haven't seen them yet, you might also like these COW tricks:
>
> Shared storage of basic catalog tables when you have a lot of databases:
> SET file_copy_method = CLONE;
> CREATE DATABASE ... STRATEGY=FILE_COPY;
>
> Fast database clone/snapshot of very large databases (caveats: users
> can't be connected to source, checkpoint forced):
> SET file_copy_method = CLONE;
> CREATE DATABASE ... STRATEGY=FILE_COPY TEMPLATE=source_db;
>
> Combine a chain of incremental backups and a full backup to produce a
> new full backup, sharing disk blocks with the ancestor backups:
> pg_combinebackup --copy-file-range
>
> That last one is a really powerful use of copy_file_range()'s subfile
> cloning powers. Another subfile cloning trick I've proposed before is
> making relation segment size user-controllable, and then allowing
> pg_upgrade to migrate between segment sizes by splicing them together.
Oh, those are really handy commands, especially the last one, yes. Many
thanks for pointing these out!
> [1] https://github.com/openzfs/zfs/commit/9927f219f1e9f4ee886d426190500abf5b1d602e
> [2] https://github.com/openzfs/zfs/commit/4800181b3b950d67a62aca7c9e28d34c8b303242
[3] https://github.com/openzfs/zfs/pull/13392#issuecomment-1742172842
[4]
https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#block_cloning
[5]
https://openzfs.github.io/openzfs-docs/man/master/7/zfsconcepts.7.html#Block_cloning
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter 'PMc' Much | 2025-11-15 16:40:38 | Re: failure to drop table due to pg_temp_7 schema |
| Previous Message | Adrian Klaver | 2025-11-15 16:06:22 | Re: failure to drop table due to pg_temp_7 schema |