Re: pg_combinebackup --copy-file-range

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --copy-file-range
Date: 2024-03-26 18:02:59
Message-ID: 90866c27-265a-4adb-89d0-18c8dd22bc19@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/26/24 15:09, Jakub Wartak wrote:
> On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>> On 3/23/24 14:47, Tomas Vondra wrote:
>>> On 3/23/24 13:38, Robert Haas wrote:
>>>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> [..]
>>> Yeah, that's in write_reconstructed_file() and the patch does not touch
>>> that at all. I agree it would be nice to use copy_file_range() in this
>>> part too, and it doesn't seem it'd be that hard to do, I think.
>>>
>>> It seems we'd just need a "fork" that either calls pread/pwrite or
>>> copy_file_range, depending on checksums and what was requested.
>>>
>>
>> Here's a patch to use copy_file_range in write_reconstructed_file too,
>> when requested/possible. One thing that I'm not sure about is whether to
>> do pg_fatal() if --copy-file-range but the platform does not support it.
> [..]
>
> Hi Tomas, so I gave a go to the below patches today:
> - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
> - v20240323-2-0002-write_reconstructed_file.patch
>
> My assessment:
>
> v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
> looks like more or less good to go

There's some issues with the --dry-run, pointed out by Robert. Should be
fixed in the attached version.

> v20240323-2-0002-write_reconstructed_file.patch - needs work and
> without that clone/copy_file_range() good effects are unlikely
>
> Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
> tables with GiST and GIN indexes , just to see more WAL record types)
> and with backups sizes in MB like that:
>
> 106831 full
> 2823 incr.1 # captured after some time with pgbench -R 100
> 165 incr.2 # captured after some time with pgbench -R 100
>
> Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
> /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
> outtest full incr.1 incr.2
>
> Test results of various copies on small I/O constrained XFS device:
> normal copy: 31m47.407s
> --clone copy: error: file cloning not supported on this platform (it's
> due #ifdef of having COPY_FILE_RANGE available)
> --copy-file-range: aborted, as it was taking too long , I was
> expecting it to accelerate, but it did not... obviously this is the
> transparent failover in case of calculating checksums...
> --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
> to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
> up with ENOSPC [more on this later]

That's really strange.

> --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
> --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> loop-fix: 5m1.986s but it creates corruption as it stands
>

Thanks. I plan to do more similar tests, once my machines get done with
some other stuff.

> Issues:
>
> 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> compains about win32/mingw:
>
> [15:47:27.184] In file included from copy_file.c:22:
> [15:47:27.184] copy_file.c: In function ‘copy_file’:
> [15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
> this statement may fall through [-Werror=implicit-fallthrough=]
> [15:47:27.184] 134 | if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
> [15:47:27.184] | ^
> [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
> [15:47:27.184] 96 | pg_log_debug("would copy \"%s\" to \"%s\"
> (copy_file_range)",
> [15:47:27.184] | ^~~~~~~~~~~~
> [15:47:27.184] copy_file.c:99:4: note: here
> [15:47:27.184] 99 | case COPY_MODE_COPYFILE:
> [15:47:27.184] | ^~~~
> [15:47:27.184] cc1: all warnings being treated as errors
>

Yup, missing break.

> 2. I do not know what's the consensus between --clone and
> --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> apply the same logic to the --help so that --clone is not visible
> there (for consistency?). Also the "error: file cloning not supported
> on this platform " is technically incorrect, Linux does support
> ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> over another (so technically it is "available"). Nitpicking I know.
>

That's a good question, I'm not sure. But whatever we do, we should do
the same thing in pg_upgrade. Maybe there's some sort of precedent?

> 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> ENOSPACE spiral-of-death-bug symptoms are like that:
>
> strace:
> copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192
> copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192
>
> Notice that dest_off_t (poutoff) is NULL.
>
> (gdb) where
> #0 0x00007f2cd56f6733 in copy_file_range (infd=8,
> pinoff=pinoff(at)entry=0x7f2cd53f54e8, outfd=outfd(at)entry=9,
> poutoff=poutoff(at)entry=0x0,
> length=length(at)entry=8192, flags=flags(at)entry=0) at
> ../sysdeps/unix/sysv/linux/copy_file_range.c:28
> #1 0x00005555ecd077f4 in write_reconstructed_file
> (copy_mode=COPY_MODE_COPY_FILE_RANGE, dry_run=false, debug=true,
> checksum_ctx=0x7ffc4cdb7700,
> offsetmap=<optimized out>, sourcemap=0x7f2cd54f6010,
> block_length=<optimized out>, output_filename=0x7ffc4cdba910
> "outtest/base/5/16427.29",
> input_filename=0x7ffc4cdba510
> "incr.2/base/5/INCREMENTAL.16427.29") at reconstruct.c:648
> #2 reconstruct_from_incremental_file
> (input_filename=input_filename(at)entry=0x7ffc4cdba510
> "incr.2/base/5/INCREMENTAL.16427.29",
> output_filename=output_filename(at)entry=0x7ffc4cdba910
> "outtest/base/5/16427.29",
> relative_path=relative_path(at)entry=0x7ffc4cdbc670 "base/5",
> bare_file_name=bare_file_name(at)entry=0x5555ee2056ef "16427.29",
> n_prior_backups=n_prior_backups(at)entry=2,
> prior_backup_dirs=prior_backup_dirs(at)entry=0x7ffc4cdbf248,
> manifests=0x5555ee137a10, manifest_path=0x7ffc4cdbad10
> "base/5/16427.29",
> checksum_type=CHECKSUM_TYPE_NONE, checksum_length=0x7ffc4cdb9864,
> checksum_payload=0x7ffc4cdb9868, debug=true, dry_run=false,
> copy_method=COPY_MODE_COPY_FILE_RANGE) at reconstruct.c:327
>
> .. it's a spiral of death till ENOSPC. Reverting the
> v20240323-2-0002-write_reconstructed_file.patch helps. The problem
> lies in that do-wb=-inifity-loop (?) along with NULL for destination
> off_t. This seem to solves that thingy(?):
>
> - do
> - {
> - wb = copy_file_range(s->fd,
> &offsetmap[i], wfd, NULL, BLCKSZ, 0);
> + //do
> + //{
> + wb = copy_file_range(s->fd,
> &offsetmap[i], wfd, &offsetmap[i], BLCKSZ, 0);
> if (wb < 0)
> pg_fatal("error while copying
> file range from \"%s\" to \"%s\": %m",
>
> input_filename, output_filename);
> - } while (wb > 0);
> + //} while (wb > 0);
> #else
>
> ...so that way I've got it down to 5mins.
>

Yeah, that retry logic is wrong. I ended up copying the check from the
"regular" copy branch, which simply bails out if copy_file_range returns
anything but the expected 8192.

I wonder if this should deal with partial writes, though. I mean, it's
allowed copy_file_range() copies only some of the bytes - I don't know
how often / in what situations that happens, though ... And if we want
to handle that for copy_file_range(), pwrite() needs the same treatment.

> 3. .. but onn startup I've got this after trying psql login: invalid
> page in block 0 of relation base/5/1259 . I've again reverted the
> v20240323-2-0002 to see if that helped for next-round of
> pg_combinebackup --manifest-checksums=NONE --copy-file-range and after
> 32mins of waiting it did help indeed: I was able to login and select
> counts worked and matched properly the data. I've reapplied the
> v20240323-2-0002 (with my fix to prevent that endless loop) and the
> issue was again(!) there. Probably it's related to the destination
> offset. I couldn't find more time to look on it today and the setup
> was big 100GB on slow device, so just letting You know as fast as
> possible.
>

Can you see if you can still reproduce this with the attached version?

> 4. More efficiency is on the table option (optional patch node ; just
> for completeness; I dont think we have time for that? ): even if
> v20240323-2-0002 would work, the problem is that it would be sending
> syscall for every 8kB. We seem to be performing lots of per-8KB
> syscalls which hinder performance (both in copy_file_range and in
> normal copy):
>
> pread64(8, ""..., 8192, 369115136) = 8192 // 369115136 + 8192 =
> 369123328 (matches next pread offset)
> write(9, ""..., 8192) = 8192
> pread64(8, ""..., 8192, 369123328) = 8192 // 369123328 + 8192 = 369131520
> write(9, ""..., 8192) = 8192
> pread64(8, ""..., 8192, 369131520) = 8192 // and so on
> write(9, ""..., 8192) = 8192
>
> Apparently there's no merging of adjacent IO/s, so pg_combinebackup
> wastes lots of time on issuing instead small syscalls but it could
> let's say do single pread/write (or even copy_file_range()). I think
> it was not evident in my earlier testing (200GB; 39min vs ~40s) as I
> had much smaller modifications in my incremental (think of 99% of
> static data).
>

Yes, I've been thinking about exactly this optimization, but I think
we're way past proposing this for PG17. The changes that would require
in reconstruct_from_incremental_file are way too significant. Has to
wait for PG18 ;-)

I do think there's more on the table, as mentioned by Thomas a couple
days ago - maybe we shouldn't approach clone/copy_file_range merely as
an optimization to save time, it might be entirely reasonable to do this
simply to allow the filesystem to do CoW magic and save space (even if
we need to read the data and recalculate the checksum, which now
disables these copy methods).

> 5. I think we should change the subject with new patch revision, so
> that such functionality for incremental backups is not buried down in
> the pg_upgrade thread ;)
>

OK.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v20240326-0001-pg_combinebackup-allow-using-clone-copy_fi.patch text/x-patch 13.7 KB
v20240326-0002-write_reconstructed_file.patch text/x-patch 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-03-26 18:09:45 Re: pg_upgrade --copy-file-range
Previous Message Peter Geoghegan 2024-03-26 18:01:58 Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan