Two patches to speed up pg_rewind.

From: Paul Guo <guopa(at)vmware(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Two patches to speed up pg_rewind.
Date: 2021-01-27 09:18:48
Message-ID: 7C1703E7-F3F3-43FA-86EB-177C671BF33C@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


While reading pg_rewind code I found two things could speed up pg_rewind.
Attached are the patches.

First one: pg_rewind would fsync the whole pgdata directory on the target by default,
but that is a waste since usually just part of the files/directories on
the target are modified. Other files on the target should have been flushed
since pg_rewind requires a clean shutdown before doing the real work. This
would help the scenario that the target postgres instance includes millions of
files, which has been seen in a real environment.

There are several things that may need further discussions:

1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data”,
but now the code guarded by it is just pre_sync_fname() relevant so we might want
to rename it as HAVE_PRE_SYNC kind of name?

2. Pre_sync_fname() implementation

The code looks like this:
#if defined(HAVE_SYNC_FILE_RANGE)
(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED.
I did not check the Linux Kernel code but according to the man
page I suspect that this option might cause the kernel tends to evict the related kernel
pages from the page cache, which might not be something we expect. This is
not a big issue since sync_file_range() should exist on many widely used Linux.

Also I’m not sure how much we could benefit from the pre_sync code. Also note if the
directory has a lot of files or the IO is fast, pre_sync_fname() might slow down
the process instead. The reasons are: If there are a lot of files it is possible that we need
to read the already-synced-and-evicted inode from disk (by open()-ing) after rewinding since
the inode cache in Linux Kernel is limited; also if the IO is faster and kernel do background
dirty page flush quickly, pre_sync_fname() might just waste cpu cycles.

A better solution might be launch a separate pthread and do fsync one by one
when pg_rewind finishes handling one file. pg_basebackup could use the solution also.

Anyway this is independent of this patch.

Second one is use copy_file_range() for the local rewind case to replace read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could benefit more
on Linux fs on network-based block storage.

Regards,
Paul

Attachment Content-Type Size
0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch application/octet-stream 8.7 KB
0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch application/octet-stream 4.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-01-27 09:27:00 Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax
Previous Message kuroda.hayato@fujitsu.com 2021-01-27 09:18:28 RE: ECPG: proposal for new DECLARE STATEMENT