Re: pg_combinebackup --copy-file-range

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --copy-file-range
Date: 2024-04-03 22:56:37
Message-ID: d9d115a7-c5e2-4258-9fc3-7514ab33f373@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's a much more polished and cleaned up version of the patches,
fixing all the issues I've been aware of, and with various parts merged
into much more cohesive parts (instead of keeping them separate to make
the changes/evolution more obvious).

I decided to reorder the changes like this:

1) block alignment - As I said earlier, I think we should definitely do
this, even if only to make future improvements possible. After chatting
about this with Robert off-list a bit, he pointed out I actually forgot
to not align the headers for files without any blocks, so this version
fixes that.

2) add clone/copy_file_range for the case that copies whole files. This
is pretty simple, but it also adds the --clone/copy-file-range options,
and similar infrastructure. The one slightly annoying bit is that we now
have the ifdef stuff in two places - when parsing the options, and then
in the copy_file_XXX methods, and the pg_fatal() calls should not be
reachable in practice. But that doesn't seem harmful, and might be a
useful protection against someone calling function that does nothing.

This also merges the original two parts, where the first one only did
this cloning/CoW stuff when checksum did not need to be calculated, and
the second extended it to those cases too (by also reading the data, but
still doing the copy the old way).

I think this is the right way - if that's not desisable, it's easy to
either add --no-manifest or not use the CoW options. Or just not change
the checksum type. There's no other way.

3) add copy_file_range to write_reconstructed_file, by using roughly the
same logic/reasoning as (2). If --copy-file-range is specified and a
checksum should be calculated, the data is read for the checksum, but
the copy is done using copy_file_range.

I did rework the flow write_reconstructed_file() flow a bit, because
tracking what exactly needs to be read/written in each of the cases
(which copy method, zeroed block, checksum calculated) made the flow
really difficult to follow. Instead I introduced a function to
read/write a block, and call them from different places.

I think this is much better, and it also makes the following part
dealing with batches of blocks much easier / smaller change.

4) prefetching - This is mostly unrelated to the CoW stuff, but it has
tremendous benefits, especially for ZFS. I haven't been able to tune ZFS
to get decent performance, and ISTM it'd be rather unusable for backup
purposes without this.

5) batching in write_reconstructed_file - This benefits especially the
copy_file_range case, where I've seen it to yield massive speedups (see
the message from Monday for better data).

6) batching for prefetch - Similar logic to (5), but for fadvise. This
is the only part where I'm not entirely sure whether it actually helps
or not. Needs some more analysis, but I'm including it for completeness.

I do think the parts (1)-(4) are in pretty good shape, good enough to
make them committable in a day or two. I see it mostly a matter of
testing and comment improvements rather than code changes.

(5) is in pretty good shape too, but I'd like to maybe simplify and
refine the write_reconstructed_file changes a bit more. I don't think
it's broken, but it feels a bit too cumbersome.

Not sure about (6) yet.

I changed how I think about this a bit - I don't really see the CoW copy
methods as necessary faster than the regular copy (even though it can be
with (5)). I think the main benefits are the space savings, enabled by
patches (1)-(3). If (4) and (5) get it, that's a bonus, but even without
that I don't think the performance is an issue - everything has a cost.

On 4/3/24 15:39, Jakub Wartak wrote:
> On Mon, Apr 1, 2024 at 9:46 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> Hi,
>>
>> I've been running some benchmarks and experimenting with various stuff,
>> trying to improve the poor performance on ZFS, and the regression on XFS
>> when using copy_file_range. And oh boy, did I find interesting stuff ...
>
> [..]
>
> Congratulations on great results!
>
>> 4) after each pg_combinebackup run to pg_verifybackup, start the cluster
>> to finish recovery, run pg_checksums --check (to check the patches don't
>> produce something broken)
>
> I've performed some follow-up small testing on all patches mentioned
> here (1..7), with the earlier developed nano-incremental-backup-tests
> that helped detect some issues for Robert earlier during original
> development. They all went fine in both cases:
> - no special options when using pg_combinebackup
> - using pg_combinebackup --copy-file-range --manifest-checksums=NONE
>
> Those were:
> test_across_wallevelminimal.sh
> test_full_pri__incr_stby__restore_on_pri.sh
> test_full_pri__incr_stby__restore_on_stby.sh
> test_full_stby__incr_stby__restore_on_pri.sh
> test_full_stby__incr_stby__restore_on_stby.sh
> test_incr_after_timelineincrease.sh
> test_incr_on_standby_after_promote.sh
> test_many_incrementals_dbcreate_duplicateOID.sh
> test_many_incrementals_dbcreate_filecopy_NOINCR.sh
> test_many_incrementals_dbcreate_filecopy.sh
> test_many_incrementals_dbcreate.sh
> test_many_incrementals.sh
> test_multixact.sh
> test_pending_2pc.sh
> test_reindex_and_vacuum_full.sh
> test_repro_assert_RP.sh
> test_repro_assert.sh
> test_standby_incr_just_backup.sh
> test_stuck_walsum.sh
> test_truncaterollback.sh
> test_unlogged_table.sh
>
>> Now to the findings ....
>>

Thanks. Would be great if you could run this on the attached version of
the patches, ideally for each of them independently, so make sure it
doesn't get broken+fixed somewhere on the way.

>>
>> 1) block alignment
>
> [..]
>
>> And I think we probably want to do this now, because this affects all
>> tools dealing with incremental backups - even if someone writes a custom
>> version of pg_combinebackup, it will have to deal with misaligned data.
>> Perhaps there might be something like pg_basebackup that "transforms"
>> the data received from the server (and also the backup manifest), but
>> that does not seem like a great direction.
>
> If anything is on the table, then I think in the far future
> pg_refresh_standby_using_incremental_backup_from_primary would be the
> only other tool using the format ?
>

Possibly, but I was thinking more about backup solutions using the same
format, but doing the client-side differently. Essentially, something
that would still use the server side to generate incremental backups,
but replace pg_combinebackup to do this differently (stream the data
somewhere else, index it somehow, or whatever).

>> 2) prefetch
>> -----------
> [..]
>> I think this means we may need a "--prefetch" option, that'd force
>> prefetching, probably both before pread and copy_file_range. Otherwise
>> people on ZFS are doomed and will have poor performance.
>
> Right, we could optionally cover in the docs later-on various options
> to get the performance (on XFS use $this, but without $that and so
> on). It's kind of madness dealing with all those performance
> variations.
>

Yeah, something like that. I'm not sure we want to talk too much about
individual filesystems in our docs, because those things evolve over
time too. And also this depends on how large the increment is. If it
only modifies 1% of the blocks, then 99% will come from the full backup,
and the sequential prefetch should do OK (well, maybe not on ZFS). But
as the incremental backup gets larger / is more random, the prefetch is
more and more important.

> Another idea: remove that 128 posifx_fadvise() hardcode in 0002 and a
> getopt variant like: --prefetch[=HOWMANY] with 128 being as default ?
>

I did think about that, but there's a dependency on the batching. If
we're prefetching ~1MB of data, we may need to prefetch up to ~1MB
ahead. Because otherwise we might want to read 1MB and only a tiny part
of that would be prefetched. I was thinking maybe we could skip the
sequential parts, but ZFS does need that.

So I don't think we can just allow users to set arbitrary values, at
least not without also tweaking the batch. Or maybe 1MB batches are too
large, and we should use something smaller? I need to think about this a
bit more ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v20240403-0001-Properly-align-blocks-in-incremental-backu.patch text/x-patch 6.6 KB
v20240403-0002-Allow-copying-files-using-clone-copy_file_.patch text/x-patch 17.1 KB
v20240403-0003-Use-copy_file_range-in-write_reconstructed.patch text/x-patch 6.3 KB
v20240403-0004-Prefetch-blocks-read-by-write_reconstructe.patch text/x-patch 10.4 KB
v20240403-0005-Try-copying-larger-chunks-of-data-from-the.patch text/x-patch 8.0 KB
v20240403-0006-Try-prefetching-larger-chunks-of-data-from.patch text/x-patch 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-04-03 22:57:59 Re: BitmapHeapScan streaming read user and prelim refactoring
Previous Message Peter Eisentraut 2024-04-03 22:51:22 Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?