Re: fdatasync performance problem with large number of DB files

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Paul Guo <guopa(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Brown <michael(dot)brown(at)discourse(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fdatasync performance problem with large number of DB files
Date: 2021-03-19 01:16:19
Message-ID: CA+hUKGJyjpuqYq60=zhoMWL5cCJKR_V9FaJtoZUv6nJCdtkPVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 19, 2021 at 5:50 AM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> On 2021/03/18 23:03, Bruce Momjian wrote:
> >> Are we sure we want to use the word "crash" here? I don't remember
> >> seeing it used anywhere else in our user interface.
>
> We have GUC restart_after_crash.
>
> > I guess it is
> >> "crash recovery".
> >
> > Maybe call it "recovery_sync_method"
> +1. This name sounds good to me. Or recovery_init_sync_method, because that
> sync happens in the initial phase of recovery.

Yeah, I was trying to fit the existing pattern
{restart,remove_temp_files}_after_crash. But
recovery_init_sync_method also sounds good to me. I prefer the
version with "init"... without "init", people might get the wrong idea
about what this controls, so let's try that. Done in the attached
version.

> Another idea from different angle is data_directory_sync_method. If we adopt
> this, we can easily extend this feature for other use cases (other than sync at
> the beginning of recovery) without changing the name.
> I'm not sure if such cases actually exist, though.

I can't imagine what -- it's like using a sledge hammer to crack a nut.

(I am aware of a semi-related idea: use the proposed fsinfo() Linux
system call to read the filesystem-wide error counter at every
checkpoint to see if anything bad happened that Linux forgot to tell
us about through the usual channels. That's the same internal
mechanism used by syncfs() to report errors, but last I checked it
hadn't been committed yet. I don't think that's share anything with
this code though.)

From your earlier email:

On Fri, Mar 19, 2021 at 2:12 AM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> + database cluster that did not shut down cleanly, including copies
> + created with pg_basebackup.
>
> "pg_basebackup" should be "<application>pg_basebackup</application>"?

Fixed.

> + while ((de = ReadDir(dir, "pg_tblspc")))
>
> The comment of SyncDataDirectory() says "Errors are logged but not
> considered fatal". So ReadDirExtended() with LOG level should be used
> here, instead?

Fixed.

> Isn't it better to call CHECK_FOR_INTERRUPTS() in this loop?

How could this be useful?

> + fd = open(path, O_RDONLY);
>
> For current use, it's not harmful to use open() and close(). But isn't
> it safer to use OpenTransientFile() and CloseTransientFile(), instead?

Ok, done, for consistency with other code.

> Because do_syncfs() may be used for other purpose in the future.

I hope not :-)

> + if (syncfs(fd) < 0)
> + elog(LOG, "syncfs failed for %s: %m", path);
>
> According to the message style guide, this message should be something
> like "could not sync filesystem for \"%s\": %m"?

Fixed.

> I confirmed that no error was reported when crash recovery started with
> syncfs, in old Linux. I should also confirm that no error is reported in that
> case in Linux 5.8+, but I don't have that environement. So I've not tested
> this feature in Linux 5.8+....

I have a Linux 5.10 system. Here's a trace of SyncDataDirectory on a
system that has two tablespaces and has a symlink for pg_wal:

[pid 3861224] lstat("pg_wal", {st_mode=S_IFLNK|0777, st_size=11, ...}) = 0
[pid 3861224] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861224] syncfs(4) = 0
[pid 3861224] close(4) = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
[pid 3861224] fstat(4, {st_mode=S_IFDIR|0700, st_size=32, ...}) = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 4 entries */, 32768) = 112
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16406", O_RDONLY) = 5
[pid 3861224] syncfs(5) = 0
[pid 3861224] close(5) = 0
[pid 3861224] openat(AT_FDCWD, "pg_tblspc/16407", O_RDONLY) = 5
[pid 3861224] syncfs(5) = 0
[pid 3861224] close(5) = 0
[pid 3861224] getdents64(4, 0x55627e18fb60 /* 0 entries */, 32768) = 0
[pid 3861224] close(4) = 0
[pid 3861224] openat(AT_FDCWD, "pg_wal", O_RDONLY) = 4
[pid 3861224] syncfs(4) = 0
[pid 3861224] close(4) = 0

To see how it looks when syncfs() fails, I added a fake implementation
that fails with EUCLEAN on every second call, and then the output
looks like this:

...
1616111356.663 startup 3879272 LOG: database system was interrupted;
last known up at 2021-03-19 12:48:33 NZDT
1616111356.663 startup 3879272 LOG: could not sync filesystem for
"pg_tblspc/16406": Structure needs cleaning
1616111356.663 startup 3879272 LOG: could not sync filesystem for
"pg_wal": Structure needs cleaning
1616111356.663 startup 3879272 LOG: database system was not properly
shut down; automatic recovery in progress
...

A more common setup with no tablespaces and pg_wal as a non-symlink looks like:

[pid 3861448] lstat("pg_wal", {st_mode=S_IFDIR|0700, st_size=316, ...}) = 0
[pid 3861448] openat(AT_FDCWD, ".", O_RDONLY) = 4
[pid 3861448] syncfs(4) = 0
[pid 3861448] close(4) = 0
[pid 3861448] openat(AT_FDCWD, "pg_tblspc",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
[pid 3861448] fstat(4, {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0
[pid 3861448] getdents64(4, 0x55764beb0b60 /* 2 entries */, 32768) = 48
[pid 3861448] getdents64(4, 0x55764beb0b60 /* 0 entries */, 32768) = 0
[pid 3861448] close(4)

The alternative fsync() mode is (unsurprisingly) much longer.

Thanks for the reviews!

PS: For illustration/discussion, I've also attached a "none" patch. I
also couldn't resist rebasing my "wal" mode patch, which I plan to
propose for PG15 because there is not enough time left for this
release.

Attachment Content-Type Size
v4-0001-Provide-recovery_init_sync_method-syncfs.patch text/x-patch 10.9 KB
v4-0002-Provide-recovery_init_sync_method-none.patch text/x-patch 3.7 KB
v4-0003-Provide-recovery_init_sync_method-wal.patch text/x-patch 15.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2021-03-19 01:21:29 Re: cleanup temporary files after crash
Previous Message Justin Pryzby 2021-03-19 01:02:32 Re: libpq compression