Re: pg_upgrade --copy-file-range

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_upgrade --copy-file-range
Date: 2024-03-26 14:09:54
Message-ID: CAKZiRmwZXcKtW4uv__H3NyREtqW6o8MZ6k4YX6M=V0T18CcbkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:

> On 3/23/24 14:47, Tomas Vondra wrote:
> > On 3/23/24 13:38, Robert Haas wrote:
> >> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
[..]
> > Yeah, that's in write_reconstructed_file() and the patch does not touch
> > that at all. I agree it would be nice to use copy_file_range() in this
> > part too, and it doesn't seem it'd be that hard to do, I think.
> >
> > It seems we'd just need a "fork" that either calls pread/pwrite or
> > copy_file_range, depending on checksums and what was requested.
> >
>
> Here's a patch to use copy_file_range in write_reconstructed_file too,
> when requested/possible. One thing that I'm not sure about is whether to
> do pg_fatal() if --copy-file-range but the platform does not support it.
[..]

Hi Tomas, so I gave a go to the below patches today:
- v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch
- v20240323-2-0002-write_reconstructed_file.patch

My assessment:

v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch -
looks like more or less good to go
v20240323-2-0002-write_reconstructed_file.patch - needs work and
without that clone/copy_file_range() good effects are unlikely

Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional
tables with GiST and GIN indexes , just to see more WAL record types)
and with backups sizes in MB like that:

106831 full
2823 incr.1 # captured after some time with pgbench -R 100
165 incr.2 # captured after some time with pgbench -R 100

Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee
/proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o
outtest full incr.1 incr.2

Test results of various copies on small I/O constrained XFS device:
normal copy: 31m47.407s
--clone copy: error: file cloning not supported on this platform (it's
due #ifdef of having COPY_FILE_RANGE available)
--copy-file-range: aborted, as it was taking too long , I was
expecting it to accelerate, but it did not... obviously this is the
transparent failover in case of calculating checksums...
--manifest-checksums=NONE --copy-file-range: BUG, it keep on appending
to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended
up with ENOSPC [more on this later]
--manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
--manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
loop-fix: 5m1.986s but it creates corruption as it stands

Issues:

1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
compains about win32/mingw:

[15:47:27.184] In file included from copy_file.c:22:
[15:47:27.184] copy_file.c: In function ‘copy_file’:
[15:47:27.184] ../../../src/include/common/logging.h:134:6: error:
this statement may fall through [-Werror=implicit-fallthrough=]
[15:47:27.184] 134 | if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \
[15:47:27.184] | ^
[15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’
[15:47:27.184] 96 | pg_log_debug("would copy \"%s\" to \"%s\"
(copy_file_range)",
[15:47:27.184] | ^~~~~~~~~~~~
[15:47:27.184] copy_file.c:99:4: note: here
[15:47:27.184] 99 | case COPY_MODE_COPYFILE:
[15:47:27.184] | ^~~~
[15:47:27.184] cc1: all warnings being treated as errors

2. I do not know what's the consensus between --clone and
--copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
apply the same logic to the --help so that --clone is not visible
there (for consistency?). Also the "error: file cloning not supported
on this platform " is technically incorrect, Linux does support
ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
over another (so technically it is "available"). Nitpicking I know.

3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
ENOSPACE spiral-of-death-bug symptoms are like that:

strace:
copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192
copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192

Notice that dest_off_t (poutoff) is NULL.

(gdb) where
#0 0x00007f2cd56f6733 in copy_file_range (infd=8,
pinoff=pinoff(at)entry=0x7f2cd53f54e8, outfd=outfd(at)entry=9,
poutoff=poutoff(at)entry=0x0,
length=length(at)entry=8192, flags=flags(at)entry=0) at
../sysdeps/unix/sysv/linux/copy_file_range.c:28
#1 0x00005555ecd077f4 in write_reconstructed_file
(copy_mode=COPY_MODE_COPY_FILE_RANGE, dry_run=false, debug=true,
checksum_ctx=0x7ffc4cdb7700,
offsetmap=<optimized out>, sourcemap=0x7f2cd54f6010,
block_length=<optimized out>, output_filename=0x7ffc4cdba910
"outtest/base/5/16427.29",
input_filename=0x7ffc4cdba510
"incr.2/base/5/INCREMENTAL.16427.29") at reconstruct.c:648
#2 reconstruct_from_incremental_file
(input_filename=input_filename(at)entry=0x7ffc4cdba510
"incr.2/base/5/INCREMENTAL.16427.29",
output_filename=output_filename(at)entry=0x7ffc4cdba910
"outtest/base/5/16427.29",
relative_path=relative_path(at)entry=0x7ffc4cdbc670 "base/5",
bare_file_name=bare_file_name(at)entry=0x5555ee2056ef "16427.29",
n_prior_backups=n_prior_backups(at)entry=2,
prior_backup_dirs=prior_backup_dirs(at)entry=0x7ffc4cdbf248,
manifests=0x5555ee137a10, manifest_path=0x7ffc4cdbad10
"base/5/16427.29",
checksum_type=CHECKSUM_TYPE_NONE, checksum_length=0x7ffc4cdb9864,
checksum_payload=0x7ffc4cdb9868, debug=true, dry_run=false,
copy_method=COPY_MODE_COPY_FILE_RANGE) at reconstruct.c:327

.. it's a spiral of death till ENOSPC. Reverting the
v20240323-2-0002-write_reconstructed_file.patch helps. The problem
lies in that do-wb=-inifity-loop (?) along with NULL for destination
off_t. This seem to solves that thingy(?):

- do
- {
- wb = copy_file_range(s->fd,
&offsetmap[i], wfd, NULL, BLCKSZ, 0);
+ //do
+ //{
+ wb = copy_file_range(s->fd,
&offsetmap[i], wfd, &offsetmap[i], BLCKSZ, 0);
if (wb < 0)
pg_fatal("error while copying
file range from \"%s\" to \"%s\": %m",

input_filename, output_filename);
- } while (wb > 0);
+ //} while (wb > 0);
#else

...so that way I've got it down to 5mins.

3. .. but onn startup I've got this after trying psql login: invalid
page in block 0 of relation base/5/1259 . I've again reverted the
v20240323-2-0002 to see if that helped for next-round of
pg_combinebackup --manifest-checksums=NONE --copy-file-range and after
32mins of waiting it did help indeed: I was able to login and select
counts worked and matched properly the data. I've reapplied the
v20240323-2-0002 (with my fix to prevent that endless loop) and the
issue was again(!) there. Probably it's related to the destination
offset. I couldn't find more time to look on it today and the setup
was big 100GB on slow device, so just letting You know as fast as
possible.

4. More efficiency is on the table option (optional patch node ; just
for completeness; I dont think we have time for that? ): even if
v20240323-2-0002 would work, the problem is that it would be sending
syscall for every 8kB. We seem to be performing lots of per-8KB
syscalls which hinder performance (both in copy_file_range and in
normal copy):

pread64(8, ""..., 8192, 369115136) = 8192 // 369115136 + 8192 =
369123328 (matches next pread offset)
write(9, ""..., 8192) = 8192
pread64(8, ""..., 8192, 369123328) = 8192 // 369123328 + 8192 = 369131520
write(9, ""..., 8192) = 8192
pread64(8, ""..., 8192, 369131520) = 8192 // and so on
write(9, ""..., 8192) = 8192

Apparently there's no merging of adjacent IO/s, so pg_combinebackup
wastes lots of time on issuing instead small syscalls but it could
let's say do single pread/write (or even copy_file_range()). I think
it was not evident in my earlier testing (200GB; 39min vs ~40s) as I
had much smaller modifications in my incremental (think of 99% of
static data).

5. I think we should change the subject with new patch revision, so
that such functionality for incremental backups is not buried down in
the pg_upgrade thread ;)

-J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-03-26 14:23:51 Re: Possibility to disable `ALTER SYSTEM`
Previous Message Kartyshov Ivan 2024-03-26 14:06:51 Re: [HACKERS] make async slave to wait for lsn to be replayed