From: | Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Paul Guo <guopa(at)vmware(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Brown <michael(dot)brown(at)discourse(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: fdatasync performance problem with large number of DB files |
Date: | 2021-03-16 08:29:05 |
Message-ID: | 76de0e61-a553-6003-aeec-cb35ada791cf@oss.nttdata.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2021/03/16 8:15, Thomas Munro wrote:
> On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa(at)vmware(dot)com> wrote:
>> By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
>> The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
>> pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
>> during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
>> let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
>> backup_label(?) does not seem to be good ideas since
>> 1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
>> 2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
>> copy the pgdata directory and then we still need fsync.
>> The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.
>
> As a thought experiment only, I wonder if there is a way to make your
> touch-a-special-signal-file scheme more reliable and less dangerous
> (considering people might copy the signal file around or otherwise
> screw this up). It seems to me that invalidation is the key, and
> "unlink the signal file after the first crash recovery" isn't good
> enough. Hmm What if the file contained a fingerprint containing...
> let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
> (add more seasoning to taste), and then also some flags to say what is
> known to be fully fsync'd already: the WAL, pgdata but only as far as
> changes up to the checkpoint LSN, or all of pgdata? Then you could be
> conservative for a non-match, but skip the extra work in some common
> cases like pg_basebackup, as long as you trust the fingerprint scheme
> not to produce false positives. Or something like that...
>
> I'm not too keen to invent clever new schemes for PG14, though. This
> sync_after_crash=syncfs scheme is pretty simple, and has the advantage
> that it's very cheap to do it extra redundant times assuming nothing
> else is creating new dirty kernel pages in serious quantities. Is
> that useful enough? In particular it avoids the dreaded "open
> 1,000,000 uncached files over high latency network storage" problem.
>
> I don't want to add a hypothetical sync_after_crash=none, because it
> seems like generally a bad idea. We already have a
> running-with-scissors mode you could use for that: fsync=off.
I heard that some backup tools sync the database directory when restoring it.
I guess that those who use such tools might want the option to disable such
startup sync (i.e., sync_after_crash=none) because it's not necessary.
They can skip that sync by fsync=off. But if they just want to skip only that
startup sync and make subsequent recovery (or standby server) work with
fsync=on, they would need to shutdown the server after that startup sync
finishes, enable fsync, and restart the server. In this case, since the server
is restarted with the state=DB_SHUTDOWNED_IN_RECOVERY, the startup sync
would not be performed. This procedure is tricky. So IMO supporting
sync_after_crash=none would be helpful for this case and simple.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From | Date | Subject | |
---|---|---|---|
Next Message | Fujii Masao | 2021-03-16 08:44:55 | Re: HotStandbyActive() issue in postgres |
Previous Message | Vik Fearing | 2021-03-16 08:21:03 | Re: GROUP BY DISTINCT |