Re: fdatasync performance problem with large number of DB files

From: Paul Guo <paulguo(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Paul Guo <guopa(at)vmware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Brown <michael(dot)brown(at)discourse(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fdatasync performance problem with large number of DB files
Date: 2021-03-16 09:44:26
Message-ID: CABQrize6RrhtVUav=kGiv0t98i1xCyqnYBE2BoxBfXM7x3rpHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 16, 2021 at 4:29 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>
>
>
> On 2021/03/16 8:15, Thomas Munro wrote:
> > On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa(at)vmware(dot)com> wrote:
> >> By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
> >> The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
> >> pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
> >> during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
> >> let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
> >> backup_label(?) does not seem to be good ideas since
> >> 1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
> >> 2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
> >> copy the pgdata directory and then we still need fsync.
> >> The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.
> >
> > As a thought experiment only, I wonder if there is a way to make your
> > touch-a-special-signal-file scheme more reliable and less dangerous
> > (considering people might copy the signal file around or otherwise
> > screw this up). It seems to me that invalidation is the key, and
> > "unlink the signal file after the first crash recovery" isn't good
> > enough. Hmm What if the file contained a fingerprint containing...
> > let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...

hostname, mac address, or pgdata path (or e.g. inode of a file?) might
be the same after vm cloning or directory copying though it is not usual.
I can not figure out a stable solution that makes the information is out of
date after vm/directory cloning/copying, so the simplest way seems to
be that leaves the decision (i.e. touching a file) to users, instead of
writing the information automatically by pg_rewind/pg_basebackup.

> > (add more seasoning to taste), and then also some flags to say what is
> > known to be fully fsync'd already: the WAL, pgdata but only as far as
> > changes up to the checkpoint LSN, or all of pgdata? Then you could be
> > conservative for a non-match, but skip the extra work in some common
> > cases like pg_basebackup, as long as you trust the fingerprint scheme
> > not to produce false positives. Or something like that...
> >
> > I'm not too keen to invent clever new schemes for PG14, though. This
> > sync_after_crash=syncfs scheme is pretty simple, and has the advantage
> > that it's very cheap to do it extra redundant times assuming nothing
> > else is creating new dirty kernel pages in serious quantities. Is
> > that useful enough? In particular it avoids the dreaded "open
> > 1,000,000 uncached files over high latency network storage" problem.
> >
> > I don't want to add a hypothetical sync_after_crash=none, because it
> > seems like generally a bad idea. We already have a
> > running-with-scissors mode you could use for that: fsync=off.
>
> I heard that some backup tools sync the database directory when restoring it.
> I guess that those who use such tools might want the option to disable such
> startup sync (i.e., sync_after_crash=none) because it's not necessary.

This scenario seems to be a support to the file touching solution since
we do not have an automatic solution to skip the fsync. I thought using
sync_after_crash=none to fix my issue but as I said we need to reset
the guc since we still expect fsync/syncfs after the 2nd crash.

> They can skip that sync by fsync=off. But if they just want to skip only that
> startup sync and make subsequent recovery (or standby server) work with
> fsync=on, they would need to shutdown the server after that startup sync
> finishes, enable fsync, and restart the server. In this case, since the server
> is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
> would not be performed. This procedure is tricky. So IMO supporting

This seems to make the process complex. From the perspective of product design,
this seems to be not attractive.

> sync_after_crash=none would be helpful for this case and simple.

Regards,
Paul Guo (Vmware)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-03-16 10:22:50 Re: Permission failures with WAL files in 13~ on Windows
Previous Message houzj.fnst@fujitsu.com 2021-03-16 09:40:53 RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table