Re: pg_upgrade --copy-file-range

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_upgrade --copy-file-range
Date: 2024-03-23 17:57:49
Message-ID: 4a039af3-3976-4716-90a9-2ee2839f1df3@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/23/24 14:47, Tomas Vondra wrote:
> On 3/23/24 13:38, Robert Haas wrote:
>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>>> Hmm, this discussion seems to assume that we only use
>>> copy_file_range() to copy/clone whole segment files, right? That's
>>> great and may even get most of the available benefit given typical
>>> databases with many segments of old data that never changes, but... I
>>> think copy_write_range() allows us to go further than the other
>>> whole-file clone techniques: we can stitch together parts of an old
>>> backup segment file and an incremental backup to create a new file.
>>> If you're interested in minimising disk use while also removing
>>> dependencies on the preceding chain of backups, then it might make
>>> sense to do that even if you *also* have to read the data to compute
>>> the checksums, I think? That's why I mentioned it: if
>>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>>> search of a problem, has the world ever seen a better problem than
>>> pg_combinebackup?
>>
>> That makes sense; it's just a different part of the code than I
>> thought we were talking about.
>>
>
> Yeah, that's in write_reconstructed_file() and the patch does not touch
> that at all. I agree it would be nice to use copy_file_range() in this
> part too, and it doesn't seem it'd be that hard to do, I think.
>
> It seems we'd just need a "fork" that either calls pread/pwrite or
> copy_file_range, depending on checksums and what was requested.
>

Here's a patch to use copy_file_range in write_reconstructed_file too,
when requested/possible. One thing that I'm not sure about is whether to
do pg_fatal() if --copy-file-range but the platform does not support it.
This is more like what pg_upgrade does, but maybe we should just ignore
what the user requested and fallback to the regular copy (a bit like
when having to do a checksum for some files). Or maybe the check should
just happen earlier ...

I've been thinking about what Thomas wrote - that maybe it'd be good to
do copy_file_range() even when calculating the checksum, and I think he
may be right. But the current patch does not do that, and while it
doesn't seem very difficult to do (at least when reconstructing the file
from incremental backups) I don't have a very good intuition whether
it'd be a win or not in typical cases.

I have a naive question about the checksumming - if we used a
merkle-tree-like scheme, i.e. hashing blocks and not hashes of blocks,
wouldn't that allow calculating the hashes even without having to read
the blocks, making copy_file_range more efficient? Sure, it's more
complex, but a well known scheme. (OK, now I realized it'd mean we can't
use tools like sha224sum to hash the files and compare to manifest. I
guess that's why we don't do this ...)

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v20240323-2-0002-write_reconstructed_file.patch text/x-patch 2.3 KB
v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch text/x-patch 13.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2024-03-23 18:20:26 Re: pg_stat_statements and "IN" conditions
Previous Message Paul Jungwirth 2024-03-23 17:42:47 Re: SQL:2011 application time