Re: Two patches to speed up pg_rewind.

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Paul Guo <paulguo(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Paul Guo <guopa(at)vmware(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Two patches to speed up pg_rewind.
Date: 2021-08-20 05:23:07
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 20, 2021 at 11:33:33AM +0800, Paul Guo wrote:
> On Tue, Aug 17, 2021 at 3:47 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > + fsync_fname("global/pg_control", false);
> > + fsync_fname("backup_label", false);
> > + if (access("recovery.conf", F_OK) == 0)
> > + fsync_fname("recovery.conf", false);
> > + if (access("", F_OK) == 0)
> > + fsync_fname("", false);
> >
> > This list is wrong on various aspects, no? This would miss custom
> > configuration files, or included files.
> I did not understand this. Can you please clarify? Anyway let me
> explain, here we fsync
> these files additionally because pg_rewind (possibly) modified these
> files after rewinding.
> These files may not be handled/logged in filemap
> pg_control action is FILE_ACTION_NONE
> backup_label is excluded
> recovery.conf is not logged in filemap
> may be logged but let's fsync this file for safety.

I am referring to new files copied from the origin cluster to the
target, as pg_rewind copies everything. postgresql.conf could for
example include a foo.conf, which would be ignored here.

Anyway, it seems to me that the copy_file_range() bits of the patch
with the copy optimizations are more interesting that the flush
optimizations, so I would tend to focus on that first.

>> + * Pre Linux 5.3 does not allow cross-fs copy_file_range() call
>> + * (return EXDEV). Some fs do not support copy_file_range() (return
>> + * ENOTSUP). Here we explicitly disable copy_file_range() for the
>> + * two scenarios. For other failures we still allow subsequent
>> + * copy_file_range() try.
>> + */
>> + if (errno == ENOTSUP || errno == EXDEV)
>> + copy_file_range_support = false;
>> Are you sure that it is right to always cancel the support of
>> copy_file_range() after it does not work once? Couldn't it be
>> possible that things work properly depending on the tablespace being
>> worked on by pg_rewind?
> Ideally we should retry when first running into a symlink (e.g.
> tablespace, wal),
> but it seems not easy to do gracefully.

My guess here is that we should just remove this flag, and attempt
copy_range_file() for each file. That brings an extra point that
requires benchmarking, actually.

>> Having the facility for copy_file_range() in pg_rewind is not nice at
>> the end, and we are going to need a run-time check to fallback
>> dynamically to an equivalent implementation on errno={EXDEV,ENOTSUP}.
>> Hmm. What about locating all that in file_utils.c instead, with a
>> brand new routine name (pg_copy_file_range would be one choice)? We
>> still need the ./configure check, except that the conditions to use
>> the fallback implementation is in this routine, aka fallback on EXDEV,
>> ENOTSUP or !HAVE_COPY_FILE_RANGE. The backend may need to solve this
>> problem at some point, but logging and fd handling will likely just
>> locate that in fd.c, so having one routine for the purpose of all
>> frontends looks like a step in the right direction.
> Yes, seems better to make it generic.

One disadvantage of having a fallback implementation in file_utils.c,
now that I look closely, is that we would make the progress reporting
less verbose as we now call write_target_range() every 8kB for each
block. So on this point, your approach keeps the code simpler, while
my suggestion makes this logic more complicated.

Anyway, I think that it would be good to do more benchmarking for this
patch first, and the patch you are proposing is enough for that.
There are two scenarios I can think as useful to look at, to emulate
cases where pg_rewind has to copy a set of files:
- A small number of large files (say 3~5 files of 600MB~1GB).
- Many small files (say 8MB~16MB with 200~ files).
Those numbers can be tweaked up or down, as long as a difference can
be measured while avoiding noise in runtimes.

I only have at hand now a system with ext4 on a 5.10 kernel, that does
not have any acceleration techniques as far as I know, so I have just
measured that this introduces no regressions. But it would be good to
also see how much performance we'd gain with something that can take
advantage of copy_file_range() in full. One good case would be XFS
with reflink=1 and see how fast we go with the two cases from above
with and without the patch. Anything I have read on the matter means
that the copy will be faster, but it would be good to have numbers to

copy_file_range_support in the patch is also something I am worrying
about, as it could lead to incorrect decisions depending on the order
of the paths processed depending on the mount points of the origin
and/or the target. Removing it means that it would be important to
measure the case where we use copy_file_range(), but fail all (or some
of!) its calls on EXDEV. That would imply a test to compare the
performance with and without the patch where the origin and target
folders are on different mount points. I looked at some kernel code
from 5.2 and anything looks cheap enough with a lookup at the inodes
of the source and targets to check the mount points involved.

So, what do you think?

By the way, the progress reporting is wrong in this new code path:

+copy_target_range(int srcfd, off_t begin, size_t size)
+ ssize_t copylen;
+ /* update progress report */
+ fetch_done += size;
+ progress_report(false);

If copy_file_range() returns false, say because of EXDEV, we would
finish by counting the same size twice, with write_target_range()
reporting this amount of size a second time for each block of 8kB.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2021-08-20 05:45:44 Re: reporting TID/table with corruption error
Previous Message Nikolay Samokhvalov 2021-08-20 05:09:45 Re: Subtransactions + a long-running transaction leading to performance degradation on standbys