Re: pg_combinebackup --copy-file-range

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --copy-file-range
Date: 2024-03-31 04:33:56
Message-ID: add23637-03a2-4fd0-890b-5fa64bb2530a@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/31/24 03:03, Thomas Munro wrote:
> On Sun, Mar 31, 2024 at 1:37 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> So I decided to take a stab at Thomas' idea, i.e. reading the data to
>> ...
>> I'll see how this works on EXT4/ZFS next ...
>
> Wow, very cool! A couple of very quick thoughts/notes:
>
> ZFS: the open source version only gained per-file block cloning in
> 2.2, so if you're on an older release I expect copy_file_range() to
> work but not really do the magic thing. On the FreeBSD version you
> also have to turn cloning on with a sysctl because people were worried
> about bugs in early versions so by default you still get actual
> copying, not sure if you need something like that on the Linux
> version... (Obviously ZFS is always completely COW-based, but before
> the new block cloning stuff it could only share blocks by its own
> magic deduplication if enabled, or by cloning/snapshotting a whole
> dataset/mountpoint; there wasn't a way to control it explicitly like
> this.)
>

I'm on 2.2.2 (on Linux). But there's something wrong, because the
pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.

I'm not sure it's a ZFS config issue, though, because it's not CPU or
I/O bound, and I see this on both machines. And some simple dd tests
show the zpool can do 10x the throughput. Could this be due to the file
header / pool alignment?

> Alignment: block sharing on any fs requires it. I haven't re-checked
> recently but IIRC the incremental file format might have a
> non-block-sized header? That means that if you copy_file_range() from
> both the older backup and also the incremental backup, only the former
> will share blocks, and the latter will silently be done by copying to
> newly allocated blocks. If that's still true, I'm not sure how hard
> it would be to tweak the format to inject some padding and to make
> sure that there isn't any extra header before each block.

I admit I'm not very familiar with the format, but you're probably right
there's a header, and header_length does not seem to consider alignment.
make_incremental_rfile simply does this:

/* Remember length of header. */
rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
sizeof(rf->truncation_block_length) +
sizeof(BlockNumber) * rf->num_blocks;

and sendFile() does the same thing when creating incremental basebackup.
I guess it wouldn't be too difficult to make sure to align this to
BLCKSZ or something like this. I wonder if the file format is documented
somewhere ... It'd certainly be nicer to tweak before v18, if necessary.

Anyway, is that really a problem? I mean, in my tests the CoW stuff
seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
that's why it took longer on XFS ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-03-31 04:46:10 Re: pg_combinebackup --copy-file-range
Previous Message John Naylor 2024-03-31 04:00:15 Re: Change GUC hashtable to use simplehash?