Re: pg_combinebackup --copy-file-range

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --copy-file-range
Date: 2024-03-27 11:05:24
Message-ID: CAKZiRmwjYhu=QNpjG+zz6Xmm=sj9fgiFviDMJXdQDAFqKe4wCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 26, 2024 at 7:03 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
[..]
>
> That's really strange.

Hi Tomas, but it looks like it's fixed now :)

> > --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s
> > --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and
> > loop-fix: 5m1.986s but it creates corruption as it stands
> >
>
> Thanks. I plan to do more similar tests, once my machines get done with
> some other stuff.

Please do so as I do not trust my fingers :-)

> > Issues:
> >
> > 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327
> > compains about win32/mingw:
> >
[..]
> >
>
> Yup, missing break.

Now it's https://cirrus-ci.com/task/4997185324974080?logs=headers_headerscheck#L10
, reproducible with "make -s headerscheck
EXTRAFLAGS='-fmax-errors=10'":
/tmp/postgres/src/bin/pg_combinebackup/reconstruct.h:34:91: error:
unknown type name ‘CopyMode’ / CopyMode copy_mode);
to me it looks like reconstruct.h needs to include definition of
CopyMode which is in "#include "copy_file.h"

> > 2. I do not know what's the consensus between --clone and
> > --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif
> > HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also
> > apply the same logic to the --help so that --clone is not visible
> > there (for consistency?). Also the "error: file cloning not supported
> > on this platform " is technically incorrect, Linux does support
> > ioctl(FICLONE) and copy_file_range(), but we are just not choosing one
> > over another (so technically it is "available"). Nitpicking I know.
> >
>
> That's a good question, I'm not sure. But whatever we do, we should do
> the same thing in pg_upgrade. Maybe there's some sort of precedent?

Sigh, you are right... It's consistent hell.

> > 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned
> > ENOSPACE spiral-of-death-bug symptoms are like that:
[..]
>
> Yeah, that retry logic is wrong. I ended up copying the check from the
> "regular" copy branch, which simply bails out if copy_file_range returns
> anything but the expected 8192.
>
> I wonder if this should deal with partial writes, though. I mean, it's
> allowed copy_file_range() copies only some of the bytes - I don't know
> how often / in what situations that happens, though ... And if we want
> to handle that for copy_file_range(), pwrite() needs the same treatment.

Maybe that helps?
https://github.com/coreutils/coreutils/blob/606f54d157c3d9d558bdbe41da8d108993d86aeb/src/copy.c#L1427
, it's harder than I anticipated (we can ignore the sparse logic
though, I think)

> > 3. .. but onn startup I've got this after trying psql login: invalid
> > page in block 0 of relation base/5/1259 .
[..]
>
> Can you see if you can still reproduce this with the attached version?

Looks like it's fixed now and it works great (~3min, multiple times)!

BTW: I've tried to also try it over NFSv4 over loopback with XFS as
copy_file_range() does server-side optimization probably, but somehow
it was so slow there that's it is close to being unusable (~9GB out of
104GB reconstructed after 45mins) - maybe it's due to NFS mount opts,
i don't think we should worry too much. I think it's related to
missing the below optimization if that matters. I think it's too early
to warn users about NFS (I've spent on it just 10 mins), but on the
other hand people might complain it's broken...

> > Apparently there's no merging of adjacent IO/s, so pg_combinebackup
> > wastes lots of time on issuing instead small syscalls but it could
> > let's say do single pread/write (or even copy_file_range()). I think
> > it was not evident in my earlier testing (200GB; 39min vs ~40s) as I
> > had much smaller modifications in my incremental (think of 99% of
> > static data).
> >
>
> Yes, I've been thinking about exactly this optimization, but I think
> we're way past proposing this for PG17. The changes that would require
> in reconstruct_from_incremental_file are way too significant. Has to
> wait for PG18 ;-)

Sure thing!

> I do think there's more on the table, as mentioned by Thomas a couple
> days ago - maybe we shouldn't approach clone/copy_file_range merely as
> an optimization to save time, it might be entirely reasonable to do this
> simply to allow the filesystem to do CoW magic and save space (even if
> we need to read the data and recalculate the checksum, which now
> disables these copy methods).

Sure ! I think time will still be a priority though, as
pg_combinebackup duration impacts RTO while disk space is relatively
cheap.

One could argue that reconstructing 50TB will be a challenge though.
Now my tests indicate space saving is already happening with 0002
patch - 100GB DB / full backup stats look like that (so we are good I
think when using CoW - not so without using CoW) -- or i misunderstood
something?:

root(at)jw-test-1:/backups# du -sm /backups/
214612 /backups/
root(at)jw-test-1:/backups# du -sm *
106831 full
2823 incr.1
165 incr.2
104794 outtest
root(at)jw-test-1:/backups# df -h . # note this double confirms that just
114GB is used (XFS), great!
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 500G 114G 387G 23% /backups
root(at)jw-test-1:/backups# # https://github.com/pwaller/sharedextents
root(at)jw-test-1:/backups# ./sharedextents-linux-amd64
full/base/5/16427.68 outtest/base/5/16427.68
1056915456 / 1073741824 bytes (98.43%) # extents reuse

Now I was wondering a little bit if the huge XFS extent allocation
won't hurt read performance (probably they were created due many
independent copy_file_range() calls):

root(at)jw-test-1:/backups# filefrag full/base/5/16427.68
full/base/5/16427.68: 1 extent found
root(at)jw-test-1:/backups# filefrag outtest/base/5/16427.68
outtest/base/5/16427.68: 3979 extents found

However in first look on seq reads of such CoW file it's still good
(I'm assuming such backup after reconstruction would be copied back to
the proper DB server from this backup server):

root(at)jw-test-1:/backups# echo 3 > /proc/sys/vm/drop_caches
root(at)jw-test-1:/backups# time cat outtest/base/5/16427.68 > /dev/null
real 0m4.286s
root(at)jw-test-1:/backups# echo 3 > /proc/sys/vm/drop_caches
root(at)jw-test-1:/backups# time cat full/base/5/16427.68 > /dev/null
real 0m4.325s

Now Thomas wrote there "then it might make sense to do that even if
you *also* have to read the data to compute the checksums, I think? "
... sounds like read(), checksum() and still do copy_file_range()
instead of pwrite? PG 18 ? :D

-J.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-03-27 11:19:06 Re: Parallel Aggregates for string_agg and array_agg
Previous Message Michael Banck 2024-03-27 10:54:54 Re: pg_upgrade failing for 200+ million Large Objects