Use of fsync; was Re: Pg_upgrade speed for many tables

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Use of fsync; was Re: Pg_upgrade speed for many tables
Date: 2012-11-24 03:22:00
Message-ID: 20121124032200.GB9382@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 19, 2012 at 12:11:26PM -0800, Jeff Janes wrote:

[ Sorry for the delay in replying.]

> On Wed, Nov 14, 2012 at 3:55 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > On Mon, Nov 12, 2012 at 10:29:39AM -0800, Jeff Janes wrote:
> >>
> >> Is turning off synchronous_commit enough? What about turning off fsync?
> >
> > I did some testing with the attached patch on a magnetic disk with no
> > BBU that turns off fsync;
>
> With which file system? I wouldn't expect you to see a benefit with
> ext2 or ext3, it seems to be a peculiarity of ext4 that inhibits
> "group fsync" of new file creations but rather does each one serially.
> Whether it is worth applying a fix that is only needed for that one
> file system, I don't know. The trade-offs are not all that clear to
> me yet.

That only ext4 shows the difference seems possible.

> > I got these results
> >
> > sync_com=off fsync=off
> > 1 15.90 13.51
> > 1000 26.09 24.56
> > 2000 33.41 31.20
> > 4000 57.39 57.74
> > 8000 102.84 116.28
> > 16000 189.43 207.84
> >
> > It shows fsync faster for < 4k, and slower for > 4k. Not sure why this
> > is the cause but perhaps the buffering of the fsync is actually faster
> > than doing a no-op fsync.
>
> synchronous-commit=off turns off not only the fsync at each commit,
> but also the write-to-kernel at each commit; so it is not surprising
> that it is faster at large scale. I would specify both
> synchronous-commit=off and fsync=off.

I would like to see actual numbers showing synchronous-commit=off is
also useful if we use fsync=off.

> >> When I'm doing a pg_upgrade with thousands of tables, the shutdown
> >> checkpoint after restoring the dump to the new cluster takes a very
> >> long time, as the writer drains its operation table by opening and
> >> individually fsync-ing thousands of files. This takes about 40 ms per
> >> file, which I assume is a combination of slow lap-top disk drive, and
> >> a strange deal with ext4 which makes fsyncing a recently created file
> >> very slow. But even with faster hdd, this would still be a problem
> >> if it works the same way, with every file needing 4 rotations to be
> >> fsynced and this happens in serial.
> >
> > Is this with the current code that does synchronous_commit=off? If not,
> > can you test to see if this is still a problem?
>
> Yes, it is with synchronous_commit=off. (or if it wasn't originally,
> it is now, with the same result)
>
> Applying your fsync patch does solve the problem for me on ext4.
> Having the new cluster be on ext3 rather than ext4 also solves the
> problem, without the need for a patch; but it would be nice to more
> friendly to ext4, which is popular even though not recommended.

Do you have numbers with synchronous-commit=off, fsync=off, and both, on
ext4?

> >> Anyway, the reason I think turning fsync off might be reasonable is
> >> that as soon as the new cluster is shut down, pg_upgrade starts
> >> overwriting most of those just-fsynced file with other files from the
> >> old cluster, and AFAICT makes no effort to fsync them. So until there
> >> is a system-wide sync after the pg_upgrade finishes, your new cluster
> >> is already in mortal danger anyway.
> >
> > pg_upgrade does a cluster shutdown before overwriting those files.
>
> Right. So as far as the cluster is concerned, those files have been
> fsynced. But then the next step is go behind the cluster's back and
> replace those fsynced files with different files, which may or may not
> have been fsynced. This is what makes me thing the new cluster is in
> mortal danger. Not only have the new files perhaps not been fsynced,
> but the cluster is not even aware of this fact, so you can start it
> up, and then shut it down, and it still won't bother to fsync them,
> because as far as it is concerned they already have been.
>
> Given that, how much extra danger would be added by having the new
> cluster schema restore run with fsync=off?
>
> In any event, I think the documentation should caution that the
> upgrade should not be deemed to be a success until after a system-wide
> sync has been done. Even if we use the link rather than copy method,
> are we sure that that is safe if the directories recording those links
> have not been fsynced?

OK, the above is something I have been thinking about, and obviously you
have too. If you change fsync from off to on in a cluster, and restart
it, there is no guarantee that the dirty pages you read from the kernel
are actually on disk, because Postgres doesn't know they are dirty.
They probably will be pushed to disk by the kernel in less than one
minute, but still, it doesn't seem reliable. Should this be documented
in the fsync section?

Again, another reason not to use fsync=off, though your example of the
file copy is a good one. As you stated, this is a problem with the file
copy/link, independent of how Postgres handles the files. We can tell
people to use 'sync' as root on Unix, but what about Windows?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-11-24 03:31:13 Re: In pg_upgrade, copy fsm, vm, and extent files by checking for fi
Previous Message Tom Lane 2012-11-23 23:45:32 Re: splitting *_desc routines