Re: pg_combinebackup --copy-file-range

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --copy-file-range
Date: 2024-03-31 00:37:25
Message-ID: 0e27835d-dab5-49cd-a3ea-52cf6d9ef59e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/29/24 15:23, Robert Haas wrote:
> On Tue, Mar 26, 2024 at 2:03 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> [ new patches ]
>
> Tomas, thanks for pointing me to this email; as you speculated, gmail
> breaks threading if the subject line changes.
>
> The documentation still needs work here:
>
> - It refers to --link mode, which is not a thing.
>
> - It should talk about the fact that in some cases block-by-block
> copying will still be needed, possibly mentioning that it specifically
> happens when the old backup manifest is not available or does not
> contain checksums or does not contain checksums of the right type, or
> maybe just being a bit vague.
>
> In copy_file.c:
>
> - You added an unnecessary blank line to the beginning of a comment block.
>

Thanks, should be all cleaned up now, I think.

> - You could keep the strategy_implementation variable here. I don't
> think that's 100% necessary, but now that you've got the code
> structured this way, there's no compelling reason to remove it.
>

Yeah, I think you're right. The strategy_implementation seemed a bit
weird to me because we now have 4 functions with different signatures.
Most only take srd/dst, but copy_file_blocks() also takes checksum. And
it seemed better to handle everything the same way, rather than treating
copy_file_blocks as an exception.

But it's not that bad, so 0001 has strategy_implementation again. But
I'll get back to this in a minute.

> - I don't know what the +/* XXX maybe this should do the check
> internally, same as the other functions? */ comment is talking about.
>

I think this is stale. The XXX is about how the various functions
detect/report support. In most we have the ifdefs/pg_fatal() inside the
function, but CopyFile() has nothing like that, because the detection
happens earlier. I wasn't sure if maybe we should make all these
functions more alike, but I don't think we should.

> - Maybe these functions should have header comments.
>

Right, added.

I was thinking about the comment [1] from a couple a days ago, where
Thomas suggested that maybe we should try doing the CoW stuff
(clone/copy_file_range) even in cases when we need to read the block,
say to calculate checksum, or even reconstruct from incremental backups.

I wasn't sure how big the benefits of the patches shared so far might
be, so I decided to do some tests. I did a fairly simple thing:

1) initialize a cluster with pgbench scale 5000 (~75GB)

2) create a full backup

3) do a run that updates ~1%, 10% and 20% of the blocks

4) create an incremental backup after each run

5) do pg_combinebackup for for each of the increments, with
block-by-block copy and copy_file_range, measure how long it takes and
how much disk space it consumes

I did this on xfs and btrfs, and it quickly became obvious that there's
very little benefit unless --no-manifest is used. Which makes perfect
sense, because pgbench is uniform updates so all segments need to be
reconstructed from increments (so copy_file.c is mostly irrelevant), and
write_reconstructed_file only uses copy_file_range() without checksums.

I don't know how common --no-manifest is going to be, but I guess people
will want to keep manifests in at least some backup schemes (e.g. to
rebuild full backup instead of having to take a full backup regularly).

So I decided to take a stab at Thomas' idea, i.e. reading the data to
calculate checksums, but then using copy_file_range instead of just
writing the data onto disk. This is in 0003, which relaxes the
conditions in 0002 shared a couple days ago. And this helped a lot.

The attached PDF shows results for xfs/btrfs. Charts on the left are
disk space occupied by the reconstructed backup, measured as difference
between "df" before and after running pg_combinebackup. The duration of
the pg_combinebackup execution is on the right. First row is without
manifest (i.e. --no-manifest), the second row is with manifest.

The 1%, 10% and 20% groups are for the various increments, updating
different fractions of the database.

The database is ~75GB, so that's what we expect a plain copy to have. If
there are some CoW benefits of copy_file_range, allowing the fs to reuse
some of the space or, the disk space should be reduced. And similarly,
there could/should be some improvements in pg_combinebackup duration.

Each bar is a different copy method and patch:

* copy on master/0001/0002/0003 - we don't expect any difference between
these, it should all perform the same and use the "full" space

* copy_file_range on 0001/0002/0003 - 0001 should perform the same as
copy, because it's only about full-segment copies, and we don't any of
those here, 0002/0003 should help, depending on --no-manifest

And indeed, this is what we see. 0002/0003 use only a fraction of disk
space, roughly the same as the updated fraction (which matches the size
of the increment). This is nice.

For duration, the benefits seem to depend on the file system. For btrfs
it actually is faster, as expected. 0002/0003 saves maybe 30-50% of
time, compared to block-by-block copy. On XFS it's not that great, the
copy_file_range is actually slower by up to about 50%. And this is not
about the extra read - this affects the 0002/no-manifest case too, where
the read is not necessary.

I think this is fine. It's a tradeoff, where on some filesystems you can
save time or space, and on other filesystems you can save both. That's a
tradeoff for the users to decide, I think.

I'll see how this works on EXT4/ZFS next ...

But thinking about this a bit more, I realized there's no reason not to
apply the same logic to the copy_file part. I mean, if we need to copy a
file but also calculate a checksum, we can simply do the clone using
clone/copy_file_range, but then also read the file and calculate the
checksum ...

0004 does this, by simply passing the checksum_cxt around, which also
has the nice consequence that all the functions now have the same
signature, which makes the strategy_implementation work for all cases. I
need to do more testing of this, but like how this looks.

Of course, maybe there's not an agreement this is the right way to
approach this, and we should do the regular block-by-block copy?

There's one more change in 0003 - the checks if clone/copy_file_range
are supported by the platform now happen right at the beginning when
parsing the arguments, so that when a user specifies one of those
options, the error happens right away instead of sometime later when we
happen to hit one of those pg_fatal() places.

I think this is the right place to do these checks, as it makes the
write_reconstructed_file much easier to understand (without all the
ifdefs etc.).

But there's an argument whether this should fail with pg_fatal() or just
fallback to the default copy method.

BTW I wonder if it makes sense to only allow one of those methods? At
the moment the user can specify both --clone and --copy-file-range, and
which one "wins" depends on the order in which they are specified. Seems
confusing at best. But maybe it'd make sense to allow both, and e.g. use
clone() to copy whole segments and copy_file_range() for other places?

regards

[1]
https://www.postgresql.org/message-id/CA%2BhUKG%2B8KDk%2BpM6vZHWT6XtZzh-sdieUDohcjj0fia6aqK3Oxg%40mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
incremental-backup-test.pdf application/pdf 422.3 KB
v20240330-0001-pg_combinebackup-allow-using-clone-copy_fi.patch text/x-patch 14.3 KB
v20240330-0002-write_reconstructed_file.patch text/x-patch 2.4 KB
v20240330-0003-extend-use-of-copy_file_range.patch text/x-patch 4.9 KB
v20240330-0004-allow-cloning-with-checksum-calculation.patch text/x-patch 6.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Koval 2024-03-31 00:56:50 Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
Previous Message Tom Lane 2024-03-31 00:15:45 Re: Security lessons from liblzma