Re: fdatasync performance problem with large number of DB files

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Michael Brown <michael(dot)brown(at)discourse(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fdatasync performance problem with large number of DB files
Date: 2021-03-10 22:38:07
Message-ID: CA+hUKGLt7zoFcJemmXzL=ZB3Rdse6Ds_Mm05G5Ogai_yZRp2=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
<michael(dot)brown(at)discourse(dot)org> wrote:
> * pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse
> or fsync_pgdata) unless --no-sync is specified
> * postgres starts up unclean (via [3] SyncDataDirectory)
>
> We run multiple postgres clusters and some of those clusters have many
> (~450) databases (one database-per-customer) meaning that the postgres
> data directory has around 700,000 files.
>
> On one of our less loaded servers this takes ~7 minutes to complete, but
> on another [4] this takes ~90 minutes.

Ouch.

> My questions are:
>
> * is there a knob missing we can configure?
> * can we get an opt-in knob to use a single sync() call instead of a
> recursive fsync()?
> * would you be open to merging a patch providing said knob?
> * is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors). syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that. I mean, I even posted
one[1] in that other thread. There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

I also wrote a WAL-and-checkpoint based prototype[2], which, among
other advantages such as being faster, not ignoring errors and not
triggering collateral write-back storms, happens to work on all
operating systems. On the other hand it requires a somewhat dogmatic
switch in thinking about the meaning of checkpoints (I mean, it
requires humans to promise not to falsify checkpoints by copying
databases around underneath us), which may be hard to sell (I didn't
try very hard), and there may be subtleties I have missed...

[1] https://www.postgresql.org/message-id/CA%2BhUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CA%2BhUKGKHhDNnN6fxf6qrAx9h%2BmjdNU2Zmx7ztJzFQ0C5%3Du3QPg%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2021-03-10 22:55:45 Re: Speeding up GIST index creation for tsvectors
Previous Message Tomas Vondra 2021-03-10 22:29:38 Re: automatic analyze: readahead - add "IO read time" log message