Re: should frontend tools use syncfs() ?

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Brown <michael(dot)brown(at)discourse(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: should frontend tools use syncfs() ?
Date: 2023-08-16 15:17:05
Message-ID: 20230816151705.GB2651383@nathanxps13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for taking a look.

On Wed, Aug 16, 2023 at 08:10:10AM +0900, Michael Paquier wrote:
> On Tue, Aug 08, 2023 at 01:06:06PM -0700, Nathan Bossart wrote:
>> I ran a couple of tests for pg_upgrade with 100k tables (created using the
>> script here [0]) in order to demonstrate the potential benefits of this
>> patch.
>
> That shows some nice numbers with many files, indeed. How does the
> size of each file influence the difference in time?

IME the number of files tends to influence the duration much more than the
size. I assume this is because most files are already sync'd in these code
paths that loop through every file.

> + else
> + {
> + while (errno = 0, (de = readdir(dir)) != NULL)
> + {
> + char subpath[MAXPGPATH * 2];
> +
> + if (strcmp(de->d_name, ".") == 0 ||
> + strcmp(de->d_name, "..") == 0)
> + continue;
>
> It seems to me that there is no need to do that for in-place
> tablespaces. There are relative paths in pg_tblspc/, so they would be
> taken care of by the syncfs() done on the main data folder.
>
> This does not really check if the mount points of each tablespace is
> different, as well. For example, imagine that you have two
> tablespaces within the same disk, syncfs() twice. Perhaps, the
> current logic is OK anyway as long as the behavior is optional, but it
> should be explained in the docs, at least.

True. But I don't know if checking the mount point of each tablespace is
worth the complexity. In the worst case, we'll call syncfs() on the same
file system a few times, which is probably still much faster in most cases.
FWIW this is what recovery_init_sync_method does today, and I'm not aware
of any complaints about this behavior.

The patch does have the following note:

+ On Linux, <literal>syncfs</literal> may be used instead to ask the
+ operating system to synchronize the whole file systems that contain the
+ data directory, the WAL files, and each tablespace.

Do you think that is sufficient, or do you think we should really clearly
explain that you could end up calling syncfs() on the same file system a
few times if your tablespaces are on the same disk? I personally feel
like that'd be a bit too verbose for the already lengthy descriptions of
this setting.

> I'm finding a bit confusing that fsync_pgdata() is coded in such a way
> that it does a silent fallback to the cascading syncs through
> walkdir() when syncfs is specified but not available in the build.
> Perhaps an error is more helpful because one would then know that they
> are trying something that's not around?

If syncfs() is not available, SYNC_METHOD_SYNCFS won't even be defined, and
parse_sync_method() should fail if "syncfs" is specified. Furthermore, the
relevant part of fsync_pgdata() won't be compiled in whenever HAVE_SYNCFS
is not defined.

> + pg_log_error("could not synchronize file system for file \"%s\": %m", path);
> + (void) close(fd);
> + exit(EXIT_FAILURE);
>
> walkdir() reports errors and does not consider these fatal. Why the
> early exit()?

I know it claims to, but fsync_fname() does exit when fsync() fails:

returncode = fsync(fd);

/*
* Some OSes don't allow us to fsync directories at all, so we can ignore
* those errors. Anything else needs to be reported.
*/
if (returncode != 0 && !(isdir && (errno == EBADF || errno == EINVAL)))
{
pg_log_error("could not fsync file \"%s\": %m", fname);
(void) close(fd);
exit(EXIT_FAILURE);
}

I suspect that the current code does not treat failures for things like
open() as fatal because it's likely due to a lack of permissions on the
file, but it does treat failures to fsync() as fatal because it is more
likely to indicate that ѕomething is very wrong. I don't know whether this
reasoning is sound, but I tried to match the current convention in the
syncfs() code.

> I am a bit concerned about the amount of duplication this patch
> introduces in the docs. Perhaps this had better be moved into a new
> section of the docs to explain the tradeoffs, with each tool linking
> to it?

Yeah, this crossed my mind. Do you know of any existing examples of
options with links to a common section? One problem with this approach is
that there are small differences in the wording for some of the frontend
utilities, so it might be difficult to cleanly unite these sections.

> Do we actually need --no-sync at all if --sync-method is around? We
> could have an extra --sync-method=none at option level with --no-sync
> still around mainly for compatibility? Or perhaps that's just
> over-designing things?

I don't have a strong opinion. We could take up deprecating --no-sync in a
follow-up thread, though. Like you said, we'll probably need to keep it
around for backward compatibility, so it might not be worth the trouble.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-08-16 15:23:25 Re: should frontend tools use syncfs() ?
Previous Message Merlin Moncure 2023-08-16 14:59:03 Re: dubious warning: FORMAT JSON has no effect for json and jsonb types