Re: CREATE DATABASE with filesystem cloning

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE DATABASE with filesystem cloning
Date: 2023-10-08 13:20:32
Message-ID: eb02dd00-3fba-9611-d2eb-b99b7c1723cf@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2023-10-07 Sa 01:51, Thomas Munro wrote:
> Hello hackers,
>
> Here is an experimental POC of fast/cheap database cloning. For
> clones from little template databases, no one cares much, but it might
> be useful to be able to create a snapshot or fork of very large
> database for testing/experimentation like this:
>
> create database foodb_snapshot20231007 template=foodb strategy=file_clone
>
> It should be a lot faster, and use less physical disk, than the two
> existing strategies on recent-ish XFS, BTRFS, very recent OpenZFS,
> APFS (= macOS), and it could in theory be extended to other systems
> that invented different system calls for this with more work (Solaris,
> Windows). Then extra physical disk space will be consumed only as the
> two clones diverge.
>
> It's just like the old strategy=file_copy, except it asks the OS to do
> its best copying trick. If you try it on a system that doesn't
> support copy-on-write, then copy_file_range() should fall back to
> plain old copy, but it might still be better than we could do, as it
> can push copy commands to network storage or physical storage.
>
> Therefore, the usual caveats from strategy=file_copy also apply here.
> Namely that it has to perform checkpoints which could be very
> expensive, and there are some quirks/brokenness about concurrent
> backups and PITR. Which makes me wonder if it's worth pursuing this
> idea. Thoughts?
>
> I tested on bleeding edge FreeBSD/ZFS, where you need to set sysctl
> vfs.zfs.bclone_enabled=1 to enable the optimisation, as it's still a
> very new feature that is still being rolled out. The system call
> succeeds either way, but that controls whether the new database
> initially shares blocks on disk, or get new copies. I also tested on
> a Mac. In both cases I could clone large databases in a fraction of a
> second.

I've had to disable COW on my BTRFS-resident buildfarm animals (see
previous discussion re Direct I/O).

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Moench-Tegeder 2023-10-08 13:57:54 Re: wal recycling problem
Previous Message Richard Guo 2023-10-08 10:52:38 Re: pg16: XX000: could not find pathkey item to sort