Re: fdatasync performance problem with large number of DB files

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Paul Guo <guopa(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Brown <michael(dot)brown(at)discourse(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fdatasync performance problem with large number of DB files
Date: 2021-03-16 08:29:05
Message-ID: 76de0e61-a553-6003-aeec-cb35ada791cf@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/03/16 8:15, Thomas Munro wrote:
> On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa(at)vmware(dot)com> wrote:
>> By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
>> The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
>> pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
>> during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
>> let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
>> backup_label(?) does not seem to be good ideas since
>> 1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
>> 2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
>> copy the pgdata directory and then we still need fsync.
>> The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.
>
> As a thought experiment only, I wonder if there is a way to make your
> touch-a-special-signal-file scheme more reliable and less dangerous
> (considering people might copy the signal file around or otherwise
> screw this up). It seems to me that invalidation is the key, and
> "unlink the signal file after the first crash recovery" isn't good
> enough. Hmm What if the file contained a fingerprint containing...
> let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
> (add more seasoning to taste), and then also some flags to say what is
> known to be fully fsync'd already: the WAL, pgdata but only as far as
> changes up to the checkpoint LSN, or all of pgdata? Then you could be
> conservative for a non-match, but skip the extra work in some common
> cases like pg_basebackup, as long as you trust the fingerprint scheme
> not to produce false positives. Or something like that...
>
> I'm not too keen to invent clever new schemes for PG14, though. This
> sync_after_crash=syncfs scheme is pretty simple, and has the advantage
> that it's very cheap to do it extra redundant times assuming nothing
> else is creating new dirty kernel pages in serious quantities. Is
> that useful enough? In particular it avoids the dreaded "open
> 1,000,000 uncached files over high latency network storage" problem.
>
> I don't want to add a hypothetical sync_after_crash=none, because it
> seems like generally a bad idea. We already have a
> running-with-scissors mode you could use for that: fsync=off.

I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.

They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2021-03-16 08:44:55 Re: HotStandbyActive() issue in postgres
Previous Message Vik Fearing 2021-03-16 08:21:03 Re: GROUP BY DISTINCT