Re: pg_combinebackup --copy-file-range

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_combinebackup --copy-file-range
Date: 2024-03-31 04:46:10
Message-ID: CA+hUKGJw-+S+BaON0yoS10iUC1mcnNWs7Wiaugxfd4Vy8d8HMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 31, 2024 at 5:33 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> I'm on 2.2.2 (on Linux). But there's something wrong, because the
> pg_combinebackup that took ~150s on xfs/btrfs, takes ~900s on ZFS.
>
> I'm not sure it's a ZFS config issue, though, because it's not CPU or
> I/O bound, and I see this on both machines. And some simple dd tests
> show the zpool can do 10x the throughput. Could this be due to the file
> header / pool alignment?

Could ZFS recordsize > 8kB be making it worse, repeatedly dealing with
the same 128kB record as you copy_file_range 16 x 8kB blocks?
(Guessing you might be using the default recordsize?)

> I admit I'm not very familiar with the format, but you're probably right
> there's a header, and header_length does not seem to consider alignment.
> make_incremental_rfile simply does this:
>
> /* Remember length of header. */
> rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) +
> sizeof(rf->truncation_block_length) +
> sizeof(BlockNumber) * rf->num_blocks;
>
> and sendFile() does the same thing when creating incremental basebackup.
> I guess it wouldn't be too difficult to make sure to align this to
> BLCKSZ or something like this. I wonder if the file format is documented
> somewhere ... It'd certainly be nicer to tweak before v18, if necessary.
>
> Anyway, is that really a problem? I mean, in my tests the CoW stuff
> seemed to work quite fine - at least on the XFS/BTRFS. Although, maybe
> that's why it took longer on XFS ...

Yeah I'm not sure, I assume it did more allocating and copying because
of that. It doesn't matter and it would be fine if a first version
weren't as good as possible, and fine if we tune the format later once
we know more, ie leaving improvements on the table. I just wanted to
share the observation. I wouldn't be surprised if the block-at-a-time
coding makes it slower and maybe makes the on disk data structures
worse, but I dunno I'm just guessing.

It's also interesting but not required to figure out how to tune ZFS
well for this purpose right now...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2024-03-31 04:55:46 Re: Introduce XID age and inactive timeout based replication slot invalidation
Previous Message Tomas Vondra 2024-03-31 04:33:56 Re: pg_combinebackup --copy-file-range