Re: CREATE DATABASE with filesystem cloning

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE DATABASE with filesystem cloning
Date: 2024-03-06 02:16:38
Message-ID: CA+hUKGJycV7PBu_+6RVo=_r-aU81ag-o8JyhcKYm40dPNp5B+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 11, 2023 at 7:40 PM Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
> On 07.10.23 07:51, Thomas Munro wrote:
> > Here is an experimental POC of fast/cheap database cloning.
>
> Here are some previous discussions of this:
>
> https://www.postgresql.org/message-id/flat/20131001223108.GG23410%40saarenmaa.fi
>
> https://www.postgresql.org/message-id/flat/511B5D11.4040507%40socialserve.com
>
> https://www.postgresql.org/message-id/flat/bc9ca382-b98d-0446-f699-8c5de2307ca7%402ndquadrant.com
>
> (I don't see any clear conclusions in any of these threads, but it might
> be good to check them in any case.)

Thanks. Wow, quite a lot of people have written an experimental patch
like this. I would say the things that changed since those ones are:

* copy_file_range() became the preferred way to do this on Linux AFAIK
(instead of various raw ioctls)
* FreeBSD adopted Linux's copy_file_range()
* Open ZFS 2.2 implemented range-based cloning
* XFS enabled reflink support by default
* Apple invented ApFS with cloning
* Several OSes adopted XFS, BTRFS, ZFS, ApFS by default
* copy_file_range() went in the direction of not revealing how the
copying is done (no flags to force behaviour)

Here's a rebase.

The main thing that is missing is support for redo. It's mostly
trivial I think, probably just a record type for "try cloning first"
and then teaching that clone function to fall back to the regular copy
path if it fails in recovery, do you agree with that idea? Another
approach would be to let it fail if it doesn't work on the replica, so
you don't finish up using dramatically different amounts of disk
space, but that seems terrible because now your replica is broken. So
probably fallback with logged warning (?), though I'm not sure exactly
which errnos to give that treatment to.

One thing to highlight about COW file system semantics: PostgreSQL
behaves differently when space runs out. When writing relation data,
eg ZFS sometimes fails like bullet point 2 in this ENOSPC article[1],
while XFS usually fails like bullet point 1. A database on XFS that
has been cloned in this way might presumably start to fail like bullet
point 2, eg when checkpointing dirty pages, instead of its usual
extension-time-only ENOSPC-rolls-back-your-transaction behaviour.

[1] https://wiki.postgresql.org/wiki/ENOSPC

Attachment Content-Type Size
v3-0001-CREATE-DATABASE-.-STRATEGY-FILE_CLONE.patch text/x-patch 7.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message wenhui qiu 2024-03-06 02:23:23 Re: Support "Right Semi Join" plan shapes
Previous Message Masahiko Sawada 2024-03-06 02:05:57 Re: Synchronizing slots from primary to standby