Re: pg_combinebackup --copy-file-range

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --copy-file-range
Date: 2024-04-01 19:43:06
Message-ID: 5ac425a4-6201-4f24-912e-8eed6905790a@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've been running some benchmarks and experimenting with various stuff,
trying to improve the poor performance on ZFS, and the regression on XFS
when using copy_file_range. And oh boy, did I find interesting stuff ...

Attached is a PDF with results of my benchmark for ZFS/XFS/BTRFS, on my
two machines. I already briefly described what the benchmark does, but
to clarify:

1) generate backups: initialize pgbench scale 5000, do full backup,
update roughly 1%, 10% and 20% blocks and do an incremental backup after
each of those steps

2) combine backups: full + 1%, full + 1% + 10%, full + 1% + 10% + 20%

3) measure how long it takes and how much more disk space is used (to
see how well the CoW stuff works)

4) after each pg_combinebackup run to pg_verifybackup, start the cluster
to finish recovery, run pg_checksums --check (to check the patches don't
produce something broken)

There's a lot of interesting stuff to discuss, some of which was already
mentioned in this thread earlier - in particular, I want to talk about
block alignment, prefetching and processing larger chunks of blocks.

Attached is also all the patches including the ugly WIP parts discussed
later, complete results if you want to do your own analysis, and the
scripts used to generate/restore scripts.

FWIW I'm not claiming the patches are commit-ready (especially the new
WIP parts), but should be correct and good enough for discussion (that
applies especially to 0007). I think I could get them ready in a day or
two, but I'd like some feedback to my findings, and also if someone
would have objections to get this in so short before the feature freeze,
I'd prefer to know about that.

The patches are numbered the same as in the benchmark results, i.e. 0001
is "1", 0002 is "2" etc. The "0-baseline" option is current master
without any patches.

Now to the findings ....

1) block alignment
------------------

This was mentioned by Thomas a couple days ago, when he pointed out the
incremental files have a variable-length header (to record which blocks
are stored in the file), followed by the block data, which means the
block data is not aligned to fs block. I haven't realized this, I just
used whatever the reconstruction function received, but Thomas pointed
out this may interfere with CoW, which needs the blocks to be aligned.

And I think he's right, and my tests confirm this. I did a trivial patch
to align the blocks to 8K boundary, by forcing the header to be a
multiple of 8K (I think 4K alignment would be enough). See the 0001
patch that does this.

And if I measure the disk space used by pg_combinebackup, and compare
the results with results without the patch ([1] from a couple days
back), I see this:

pct not aligned aligned
-------------------------------------
1% 689M 19M
10% 3172M 22M
20% 13797M 27M

Yes, those numbers are correct. I didn't believe this at first, but the
backups are valid/verified, checksums are OK, etc. BTRFS has similar
numbers (e.g. drop from 20GB to 600MB).

If you look at the charts in the PDF, charts for on-disk space are on
the right side. It might seem like copy_file_range/CoW has no impact,
but that's just an illusion - the bars for the last three cases are so
small it's difficult to see them (especially on XFS). While this does
not show the impact of alignment (because all of the cases in these runs
have blocks aligned), it shows how tiny the backups can be made. But it
does have significant impact, per the preceding paragraph.

This also affect the prefetching, that I'm going to talk about next. But
having the blocks misaligned (spanning multiple 4K pages) forces the
system to prefetch more pages than necessary. I don't know how big the
impact is, because the prefetch patch is 0002, so I only have results
for prefetching on aligned blocks, but I don't see how it could not have
a cost.

I do think we should just align the blocks properly. The 0001 patch does
that simply by adding a bunch of \0 bytes up to the next 8K boundary.
Yes, this has a cost - if you have tiny files with only one or two
blocks changed, the increment file will be a bit larger. Files without
any blocks don't need alignment/padding, and as the number of blocks
increases, it gets negligible pretty quickly. Also, files use a multiple
of fs blocks anyway, so if we align to 4K blocks it wouldn't actually
need more space at all. And even if it does, it's all \0, so pretty damn
compressible (and I'm sorry, but if you care about tiny amounts of data
added by alignment, but refuse to use compression ...).

I think we absolutely need to align the blocks in the incremental files,
and I think we should do that now. I think 8K would work, but maybe we
should add alignment parameter to basebackup & manifest?

The reason why I think maybe this should be a basebackup parameter is
the recent discussion about large fs blocks - it seems to be in the
works, so maybe better to be ready and not assume all fs have 4K.

And I think we probably want to do this now, because this affects all
tools dealing with incremental backups - even if someone writes a custom
version of pg_combinebackup, it will have to deal with misaligned data.
Perhaps there might be something like pg_basebackup that "transforms"
the data received from the server (and also the backup manifest), but
that does not seem like a great direction.

Note: Of course, these space savings only exist thanks to sharing blocks
with the input backups, because the blocks in the combined backup point
to one of the other backups. If those old backups are removed, then the
"saved space" disappears because there's only a single copy.

2) prefetch
-----------

I was very puzzled by the awful performance on ZFS. When every other fs
(EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
advice I could think of, with almost no effect.

Ultimately I decided that it probably is the "no readahead" behavior
I've observed on ZFS. I assume it's because it doesn't use the page
cache where the regular readahead is detected etc. And there's no
prefetching in pg_combinebackup, so I decided to an experiment and added
a trivial explicit prefetch when reconstructing the file - every time
we'd read data from a file, we do posix_fadvise for up to 128 blocks
ahead (similar to what bitmap heap scan code does). See 0002.

And tadaaa - the duration dropped from 900-1000 seconds to only about
250-300 seconds, so an improvement of a factor of 3-4x. I think this is
pretty massive.

There's a couple more interesting ZFS details - the prefetching seems to
be necessary even when using copy_file_range() and don't need to read
the data (to calculate checksums). This is why the "manifest=off" chart
has the strange group of high bars at the end - the copy cases are fast
because prefetch happens, but if we switch to copy_file_range() there
are no prefetches and it gets slow.

This is a bit bizarre, especially because the manifest=on cases are
still fast, exactly because the pread + prefetching still happens. I'm
sure users would find this puzzling.

Unfortunately, the prefetching is not beneficial for all filesystems.
For XFS it does not seem to make any difference, but on BTRFS it seems
to cause a regression.

I think this means we may need a "--prefetch" option, that'd force
prefetching, probably both before pread and copy_file_range. Otherwise
people on ZFS are doomed and will have poor performance.

3) bulk operations
------------------

Another thing suggested by Thomas last week was that maybe we should try
detecting longer runs of blocks coming from the same file, and operate
on them as a single chunk of data. If you see e.g. 32 blocks, instead of
doing read/write or copy_file_range for each of them, we could simply do
one call for all those blocks at once.

I think this is pretty likely, especially for small incremental backups
where most of the blocks will come from the full backup. And I was
suspecting the XFS regression (where the copy-file-range was up to
30-50% slower in some cases, see [1]) is related to this, because the
perf profiles had stuff like this:

97.28% 2.10% pg_combinebacku [kernel.vmlinux] [k]
|
|--95.18%--entry_SYSCALL_64
| |
| --94.99%--do_syscall_64
| |
| |--74.13%--__do_sys_copy_file_range
| | |
| | --73.72%--vfs_copy_file_range
| | |
| | --73.14%--xfs_file_remap_range
| | |
| | |--70.65%--xfs_reflink_remap_blocks
| | | |
| | | --69.86%--xfs_reflink_remap_extent

So I took a stab at this in 0007, which detects runs of blocks coming
from the same source file (limited to 128 blocks, i.e. 1MB). I only did
this for the copy_file_range() calls in 0007, and the results for XFS
look like this (complete results are in the PDF):

old (block-by-block) new (batches)
------------------------------------------------------
1% 150s 4s
10% 150-200s 46s
20% 150-200s 65s

Yes, once again, those results are real, the backups are valid etc. So
not only it takes much less space (thanks to block alignment), it also
takes much less time (thanks to bulk operations).

The cases with "manifest=on" improve too, but not nearly this much. I
believe this is simply because the read/write still happens block by
block. But it shouldn't be difficult to do in a bulk manner too (we
already have the range detected, but I was lazy).

[1]
https://www.postgresql.org/message-id/0e27835d-dab5-49cd-a3ea-52cf6d9ef59e%40enterprisedb.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v20240401-0001-WIP-block-alignment.patch text/x-patch 3.0 KB
v20240401-0002-WIP-prefetch-blocks-when-reconstructing-fi.patch text/x-patch 1.5 KB
v20240401-0003-use-clone-copy_file_range-to-copy-whole-fi.patch text/x-patch 14.4 KB
v20240401-0004-use-copy_file_range-in-write_reconstructed.patch text/x-patch 2.5 KB
v20240401-0005-use-copy_file_range-with-checksums.patch text/x-patch 5.3 KB
v20240401-0006-allow-cloning-with-checksum-calculation.patch text/x-patch 6.5 KB
v20240401-0007-WIP-copy-larger-chunks-from-the-same-file.patch text/x-patch 7.2 KB
xeon-nvme-xfs.csv text/csv 4.9 KB
i5-ssd-zfs.csv text/csv 4.7 KB
xeon-nvme-btrfs.csv text/csv 4.9 KB
benchmark-results.pdf application/pdf 365.8 KB
generate-backups.sh application/x-shellscript 2.4 KB
restore-backups.sh application/x-shellscript 5.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Maiquel Grassi 2024-04-01 19:51:57 RE: Psql meta-command conninfo+
Previous Message Andrew Dunstan 2024-04-01 19:37:19 Re: Broken error detection in genbki.pl