Re: fdatasync performance problem with large number of DB files

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Paul Guo <guopa(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Brown <michael(dot)brown(at)discourse(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fdatasync performance problem with large number of DB files
Date: 2021-03-15 23:15:05
Message-ID: CA+hUKGJi11x6cHJQNeDnLRvk1GMX18DJQaKFqwCOjgeFnbiVKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 16, 2021 at 3:30 AM Paul Guo <guopa(at)vmware(dot)com> wrote:
> By the way, there is a usual case that we could skip fsync: A fsync-ed already standby generated by pg_rewind/pg_basebackup.
> The state of those standbys are surely not DB_SHUTDOWNED/DB_SHUTDOWNED_IN_RECOVERY, so the
> pgdata directory is fsync-ed again during startup when starting those pg instances. We could ask users to not fsync
> during pg_rewind&pg_basebackup, but we probably want to just fsync some files in pg_rewind (see [1]), so better
> let the startup process skip the unnecessary fsync? As to the solution, using guc or writing something in some files like
> backup_label(?) does not seem to be good ideas since
> 1. Use guc, we still expect fsync after real crash recovery so we need to reset the guc also need to specify pgoptions in pg_ctl command.
> 2. Write some hint information to files like backup_label(?) in pg_rewind/pg_basebackup, but people might
> copy the pgdata directory and then we still need fsync.
> The only one simple solution I can think out is to let user touch a file to hint startup, before starting the pg instance.

As a thought experiment only, I wonder if there is a way to make your
touch-a-special-signal-file scheme more reliable and less dangerous
(considering people might copy the signal file around or otherwise
screw this up). It seems to me that invalidation is the key, and
"unlink the signal file after the first crash recovery" isn't good
enough. Hmm What if the file contained a fingerprint containing...
let's see... checkpoint LSN, hostname, MAC address, pgdata path, ...
(add more seasoning to taste), and then also some flags to say what is
known to be fully fsync'd already: the WAL, pgdata but only as far as
changes up to the checkpoint LSN, or all of pgdata? Then you could be
conservative for a non-match, but skip the extra work in some common
cases like pg_basebackup, as long as you trust the fingerprint scheme
not to produce false positives. Or something like that...

I'm not too keen to invent clever new schemes for PG14, though. This
sync_after_crash=syncfs scheme is pretty simple, and has the advantage
that it's very cheap to do it extra redundant times assuming nothing
else is creating new dirty kernel pages in serious quantities. Is
that useful enough? In particular it avoids the dreaded "open
1,000,000 uncached files over high latency network storage" problem.

I don't want to add a hypothetical sync_after_crash=none, because it
seems like generally a bad idea. We already have a
running-with-scissors mode you could use for that: fsync=off.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-03-15 23:16:16 Re: New IndexAM API controlling index vacuum strategies
Previous Message Andres Freund 2021-03-15 23:11:10 Re: New IndexAM API controlling index vacuum strategies