Re: Use of fsync; was Re: Pg_upgrade speed for many tables

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: Use of fsync; was Re: Pg_upgrade speed for many tables
Date: 2012-12-04 03:50:32
Message-ID: 20121204035032.GA30893@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Applied.

---------------------------------------------------------------------------

On Fri, Nov 30, 2012 at 10:43:29PM -0500, Bruce Momjian wrote:
> On Mon, Nov 26, 2012 at 02:43:19PM -0500, Bruce Momjian wrote:
> > > >> In any event, I think the documentation should caution that the
> > > >> upgrade should not be deemed to be a success until after a system-wide
> > > >> sync has been done. Even if we use the link rather than copy method,
> > > >> are we sure that that is safe if the directories recording those links
> > > >> have not been fsynced?
> > > >
> > > > OK, the above is something I have been thinking about, and obviously you
> > > > have too. If you change fsync from off to on in a cluster, and restart
> > > > it, there is no guarantee that the dirty pages you read from the kernel
> > > > are actually on disk, because Postgres doesn't know they are dirty.
> > > > They probably will be pushed to disk by the kernel in less than one
> > > > minute, but still, it doesn't seem reliable. Should this be documented
> > > > in the fsync section?
> > > >
> > > > Again, another reason not to use fsync=off, though your example of the
> > > > file copy is a good one. As you stated, this is a problem with the file
> > > > copy/link, independent of how Postgres handles the files. We can tell
> > > > people to use 'sync' as root on Unix, but what about Windows?
> > >
> > > I'm pretty sure someone mentioned the way to do that on Windows in
> > > this list in the last few months, but I can't seem to find it. I
> > > thought it was the initdb fsync thread.
> >
> > Yep, the code is already in initdb to fsync a directory --- we just need
> > a way for pg_upgrade to access it.
>
> I have developed the attached patch that does this. It basically adds
> an --sync-only option to initdb, then turns off all durability in
> pg_upgrade and has pg_upgrade run initdb --sync-only; this give us
> another nice speedup!
>
> ------ SSD ---- -- magnetic ---
> git patch git patch
> 1 11.11 11.11 11.10 11.13
> 1000 20.57 19.89 20.72 19.30
> 2000 28.02 25.81 28.50 27.53
> 4000 42.00 43.59 46.71 46.84
> 8000 89.66 74.16 89.10 73.67
> 16000 157.66 135.98 159.97 153.48
> 32000 316.24 296.90 334.74 308.59
> 64000 814.97 715.53 797.34 727.94
>
> (I am very happy with these times. Thanks to Jeff Janes for his
> suggestions.)
>
> I have also added documentation to the 'fsync' configuration variable
> warning about dirty buffers and recommending flushing them to disk
> before the cluster is crash-recovery safe.
>
> I consider this patch ready for 9.3 application (meaning it is not a
> prototype).
>
> --
> Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> EnterpriseDB http://enterprisedb.com
>
> + It's impossible for everything to be true. +

> diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
> new file mode 100644
> index c12f15b..63df529
> *** a/contrib/pg_upgrade/pg_upgrade.c
> --- b/contrib/pg_upgrade/pg_upgrade.c
> *************** main(int argc, char **argv)
> *** 150,155 ****
> --- 150,161 ----
> new_cluster.pgdata);
> check_ok();
>
> + prep_status("Sync data directory to disk");
> + exec_prog(UTILITY_LOG_FILE, NULL, true,
> + "\"%s/initdb\" --sync-only \"%s\"", new_cluster.bindir,
> + new_cluster.pgdata);
> + check_ok();
> +
> create_script_for_cluster_analyze(&analyze_script_file_name);
> create_script_for_old_cluster_deletion(&deletion_script_file_name);
>
> diff --git a/contrib/pg_upgrade/server.c b/contrib/pg_upgrade/server.c
> new file mode 100644
> index 49d4c8f..05d8cc0
> *** a/contrib/pg_upgrade/server.c
> --- b/contrib/pg_upgrade/server.c
> *************** start_postmaster(ClusterInfo *cluster)
> *** 209,217 ****
> * a gap of 2000000000 from the current xid counter, so autovacuum will
> * not touch them.
> *
> ! * synchronous_commit=off improves object creation speed, and we only
> ! * modify the new cluster, so only use it there. If there is a crash,
> ! * the new cluster has to be recreated anyway.
> */
> snprintf(cmd, sizeof(cmd),
> "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s%s%s\" start",
> --- 209,217 ----
> * a gap of 2000000000 from the current xid counter, so autovacuum will
> * not touch them.
> *
> ! * Turn off durability requirements to improve object creation speed, and
> ! * we only modify the new cluster, so only use it there. If there is a
> ! * crash, the new cluster has to be recreated anyway.
> */
> snprintf(cmd, sizeof(cmd),
> "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s%s%s\" start",
> *************** start_postmaster(ClusterInfo *cluster)
> *** 219,225 ****
> (cluster->controldata.cat_ver >=
> BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" :
> " -c autovacuum=off -c autovacuum_freeze_max_age=2000000000",
> ! (cluster == &new_cluster) ? " -c synchronous_commit=off" : "",
> cluster->pgopts ? cluster->pgopts : "", socket_string);
>
> /*
> --- 219,226 ----
> (cluster->controldata.cat_ver >=
> BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" :
> " -c autovacuum=off -c autovacuum_freeze_max_age=2000000000",
> ! (cluster == &new_cluster) ?
> ! " -c synchronous_commit=off -c fsync=off -c full_page_writes=off" : "",
> cluster->pgopts ? cluster->pgopts : "", socket_string);
>
> /*
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> new file mode 100644
> index b56070b..b7df8ce
> *** a/doc/src/sgml/config.sgml
> --- b/doc/src/sgml/config.sgml
> *************** include 'filename'
> *** 1697,1702 ****
> --- 1697,1711 ----
> </para>
>
> <para>
> + For reliable recovery when changing <varname>fsync</varname>
> + off to on, it is necessary to force all modified buffers in the
> + kernel to durable storage. This can be done while the cluster
> + is shutdown or while fsync is on by running <command>initdb
> + --sync-only</command>, running <command>sync</>, unmounting the
> + file system, or rebooting the server.
> + </para>
> +
> + <para>
> In many situations, turning off <xref linkend="guc-synchronous-commit">
> for noncritical transactions can provide much of the potential
> performance benefit of turning off <varname>fsync</varname>, without
> diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
> new file mode 100644
> index 08ee37e..a1e46eb
> *** a/doc/src/sgml/ref/initdb.sgml
> --- b/doc/src/sgml/ref/initdb.sgml
> *************** PostgreSQL documentation
> *** 245,250 ****
> --- 245,261 ----
> </varlistentry>
>
> <varlistentry>
> + <term><option>-S</option></term>
> + <term><option>--sync-only</option></term>
> + <listitem>
> + <para>
> + Safely write all database files to disk and exit. This does not
> + perform any of the normal <application>initdb</> operations.
> + </para>
> + </listitem>
> + </varlistentry>
> +
> + <varlistentry>
> <term><option>-T <replaceable>CFG</></option></term>
> <term><option>--text-search-config=<replaceable>CFG</></option></term>
> <listitem>
> diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
> new file mode 100644
> index 402504b..8c0a9f4
> *** a/src/bin/initdb/initdb.c
> --- b/src/bin/initdb/initdb.c
> *************** static const char *authmethodlocal = "";
> *** 118,123 ****
> --- 118,124 ----
> static bool debug = false;
> static bool noclean = false;
> static bool do_sync = true;
> + static bool sync_only = false;
> static bool show_setting = false;
> static char *xlog_dir = "";
>
> *************** usage(const char *progname)
> *** 2796,2801 ****
> --- 2797,2803 ----
> printf(_(" -n, --noclean do not clean up after errors\n"));
> printf(_(" -N, --nosync do not wait for changes to be written safely to disk\n"));
> printf(_(" -s, --show show internal settings\n"));
> + printf(_(" -S, --sync-only only sync data directory\n"));
> printf(_("\nOther options:\n"));
> printf(_(" -V, --version output version information, then exit\n"));
> printf(_(" -?, --help show this help, then exit\n"));
> *************** main(int argc, char *argv[])
> *** 3445,3450 ****
> --- 3447,3453 ----
> {"show", no_argument, NULL, 's'},
> {"noclean", no_argument, NULL, 'n'},
> {"nosync", no_argument, NULL, 'N'},
> + {"sync-only", no_argument, NULL, 'S'},
> {"xlogdir", required_argument, NULL, 'X'},
> {NULL, 0, NULL, 0}
> };
> *************** main(int argc, char *argv[])
> *** 3476,3482 ****
>
> /* process command-line options */
>
> ! while ((c = getopt_long(argc, argv, "dD:E:L:nNU:WA:sT:X:", long_options, &option_index)) != -1)
> {
> switch (c)
> {
> --- 3479,3485 ----
>
> /* process command-line options */
>
> ! while ((c = getopt_long(argc, argv, "dD:E:L:nNU:WA:sST:X:", long_options, &option_index)) != -1)
> {
> switch (c)
> {
> *************** main(int argc, char *argv[])
> *** 3522,3527 ****
> --- 3525,3533 ----
> case 'N':
> do_sync = false;
> break;
> + case 'S':
> + sync_only = true;
> + break;
> case 'L':
> share_path = pg_strdup(optarg);
> break;
> *************** main(int argc, char *argv[])
> *** 3589,3594 ****
> --- 3595,3608 ----
> exit(1);
> }
>
> + /* If we only need to fsync, just to it and exit */
> + if (sync_only)
> + {
> + setup_pgdata();
> + perform_fsync();
> + return 0;
> + }
> +
> if (pwprompt && pwfilename)
> {
> fprintf(stderr, _("%s: password prompt and password file cannot be specified together\n"), progname);

>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2012-12-04 04:24:08 Re: In pg_upgrade, remove 'set -x' from test script.
Previous Message Michael Glaesemann 2012-12-04 03:41:58 Re: Tablespaces in the data directory