Re: CREATE DATABASE with filesystem cloning

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CREATE DATABASE with filesystem cloning
Date: 2023-10-08 23:19:30
Message-ID: CA+hUKGKO=R=K5meb4+n2hmH6W=_Q01TT_89KPJS6BdDuJTJQdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 9, 2023 at 2:20 AM Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> I've had to disable COW on my BTRFS-resident buildfarm animals (see
> previous discussion re Direct I/O).

Right, because it is still buggy[1]. I don't see any sign that a fix
has been committed yet, assuming that is the right thing (and it sure
sounds like it). It means you still have to disable COW to run the
004_io_direct.pl test for now, but that's an independent thing due
hopefully to be fixed soon, and you can still run PostgreSQL just fine
with COW enabled as it is by default as long as you don't turn on
debug_io_direct (which isn't for users yet anyway).

Since I hadn't actually tried out this cloning stuff out on
Linux/btrfs before and was claiming that it should work, I took it for
a quick unscientific spin (literally, this is on a spinning SATA disk
for extra crunchy slowness...). I created a scale 500 pgbench
database, saw that du -h showed 7.4G, and got these times:

postgres=# create database foodb_copy template=foodb strategy=file_copy;
CREATE DATABASE
Time: 124019.885 ms (02:04.020)
postgres=# create database foodb_clone template=foodb strategy=file_clone;
CREATE DATABASE
Time: 8618.195 ms (00:08.618)

That's something, but not as good as I was expecting, so let's also
try Linux/XFS for reference on the same spinny rust... One thing I
learned is that if you have an existing XFS partition, it might have
been created without reflinks enabled (see output of xfs_info) as that
was the default not very long ago and it's not changeable later, so on
the box I'm writing from I had to do a fresh mkfs.xfs to see any
benefit from this.

postgres=# create database foodb_copy template=foodb strategy=file_copy;
CREATE DATABASE
Time: 49157.876 ms (00:49.158)
postgres=# create database foodb_clone template=foodb strategy=file_clone;
CREATE DATABASE
Time: 1026.455 ms (00:01.026)

Not bad. To understand what that did, we can check which physical
blocks on disk hold the first segment of the pgbench_accounts table in
foodb and foodb_clone:

$ sudo xfs_bmap /mnt/xfs/pgdata/base/16384/16400
/mnt/xfs/pgdata/base/16384/16400:
0: [0..1637439]: 977586048..979223487
1: [1637440..2097151]: 1464966136..1465425847
$ sudo xfs_bmap /mnt/xfs/pgdata/base/16419/16400
/mnt/xfs/pgdata/base/16419/16400:
0: [0..1637439]: 977586048..979223487
1: [1637440..2097151]: 1464966136..1465425847

The same blocks. Now let's update a tuple on the second page of
pgbench_accounts in the clone:

foodb=# update pgbench_accounts set bid = bid + 1 where ctid = '(1, 1)';
UPDATE 1
foodb=# checkpoint;
CHECKPOINT

Now some new physical disk blocks have been allocated just for that
page, but the rest are still clones:

$ sudo xfs_bmap /mnt/xfs/pgdata/base/16419/16400
/mnt/xfs/pgdata/base/16419/16400:
0: [0..15]: 977586048..977586063
1: [16..31]: 977586064..977586079
2: [32..1637439]: 977586080..979223487
3: [1637440..2097151]: 1464966136..1465425847

I tried changing it to work in 1MB chunks and add the CFI() (v2
attached), and it didn't affect the time measurably and also didn't
generate any extra extents as displayed by xfs_bmap, so the end result
is the same. I haven't looked into the chunked version on the other
file systems yet.

I don't have the numbers to hand (different machines far from me right
now) but FreeBSD/ZFS and macOS/APFS were on the order of a few hundred
milliseconds for the same scale of pgbench on laptop storage (so not
comparable with the above).

I also tried a -s 5000 database, and saw that XFS could clone a 74GB
database just as fast as the 7.4GB database (still ~1s). At a guess,
this is going to scale not so much by total data size, but more by
things like number of relations, segment size and internal (invisible)
fragmentation due to previous cloning/update history in
filesystem-dependent ways, since those are the things that generate
extents (contiguous ranges of physical blocks to be referenced by the
new file).

[1] https://lore.kernel.org/linux-btrfs/ae81e48b0e954bae1c3451c0da1a24ae7146606c(dot)1676684984(dot)git(dot)boris(at)bur(dot)io/T/#u

Attachment Content-Type Size
v2-0001-WIP-CREATE-DATABASE-.-STRATEGY-FILE_CLONE.patch text/x-patch 10.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-10-08 23:26:24 Re: Problem, partition pruning for prepared statement with IS NULL clause.
Previous Message David E. Wheeler 2023-10-08 23:13:08 Re: JSON Path and GIN Questions