Re: [HACKERS] fsync method checking

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] fsync method checking
Date: 2004-03-18 17:46:13
Message-ID: 200403181746.i2IHkDA00975@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance


I have been poking around with our fsync default options to see if I can
improve them. One issue is that we never default to O_SYNC, but default
to O_DSYNC if it exists, which seems strange.

What I did was to beef up my test program and get it into CVS for folks
to run. What I found was that different operating systems have
different optimal defaults. On BSD/OS and FreeBSD, fdatasync/fsync was
better, but on Linux, O_DSYNC/O_SYNC was faster.

BSD/OS 4.3:
Simple write timing:
write 0.000055

Compare fsync before and after write's close:
write, fsync, close 0.000707
write, close, fsync 0.000808

Compare one o_sync write to two:
one 16k o_sync write 0.009762
two 8k o_sync writes 0.008799

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000658
(fdatasync unavailable)
write, fsync, 0.000702

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write 0.010402
(fdatasync unavailable)
write, fsync, 0.001025

This shows terrible O_SYNC performance for 2 8k writes, but is faster
for a single 8k write. Strange.

FreeBSD 4.9:
Simple write timing:
write 0.000083

Compare fsync before and after write's close:
write, fsync, close 0.000412
write, close, fsync 0.000453

Compare one o_sync write to two:
one 16k o_sync write 0.000409
two 8k o_sync writes 0.000993

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000683
(fdatasync unavailable)
write, fsync, 0.000405

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.000789
(fdatasync unavailable)
write, fsync, 0.000414

This shows fsync to be fastest in both cases.

Linux 2.4.9:
Simple write timing:
write 0.000061

Compare fsync before and after write's close:
write, fsync, close 0.000398
write, close, fsync 0.000407

Compare one o_sync write to two:
one 16k o_sync write 0.000570
two 8k o_sync writes 0.000340

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000166
write, fdatasync 0.000462
write, fsync, 0.000447

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.000334
write, fdatasync 0.000445
write, fsync, 0.000447

This shows O_SYNC to be fastest, even for 2 8k writes.

This unapplied patch:

ftp://candle.pha.pa.us/pub/postgresql/mypatches/fsync

adds DEFAULT_OPEN_SYNC to the bsdi/freebsd/linux template files, which
controls the default for those platforms. Platforms with no template
default to fdatasync/fsync.

Would other users run src/tools/fsync and report their findings so I can
update the template files for their OS's? This is a process similar to
our thread testing.

Thanks.

---------------------------------------------------------------------------

Bruce Momjian wrote:
> Mark Kirkwood wrote:
> > This is a well-worn thread title - apologies, but these results seemed
> > interesting, and hopefully useful in the quest to get better performance
> > on Solaris:
> >
> > I was curious to see if the rather uninspiring pgbench performance
> > obtained from a Sun 280R (see General: ATA Disks and RAID controllers
> > for database servers) could be improved if more time was spent
> > tuning.
> >
> > With the help of a fellow workmate who is a bit of a Solaris guy, we
> > decided to have a go.
> >
> > The major performance killer appeared to be mounting the filesystem with
> > the logging option. The next most significant seemed to be the choice of
> > sync_method for Pg - the default (open_datasync), which we initially
> > thought should be the best - appears noticeably slower than fdatasync.
>
> I thought the default was fdatasync, but looking at the code it seems
> the default is open_datasync if O_DSYNC is available.
>
> I assume the logic is that we usually do only one write() before
> fsync(), so open_datasync should be faster. Why do we not use O_FSYNC
> over fsync().
>
> Looking at the code:
>
> #if defined(O_SYNC)
> #define OPEN_SYNC_FLAG O_SYNC
> #else
> #if defined(O_FSYNC)
> #define OPEN_SYNC_FLAG O_FSYNC
> #endif
> #endif
>
> #if defined(OPEN_SYNC_FLAG)
> #if defined(O_DSYNC) && (O_DSYNC != OPEN_SYNC_FLAG)
> #define OPEN_DATASYNC_FLAG O_DSYNC
> #endif
> #endif
>
> #if defined(OPEN_DATASYNC_FLAG)
> #define DEFAULT_SYNC_METHOD_STR "open_datasync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
> #define DEFAULT_SYNC_FLAGBIT OPEN_DATASYNC_FLAG
> #else
> #if defined(HAVE_FDATASYNC)
> #define DEFAULT_SYNC_METHOD_STR "fdatasync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC
> #define DEFAULT_SYNC_FLAGBIT 0
> #else
> #define DEFAULT_SYNC_METHOD_STR "fsync"
> #define DEFAULT_SYNC_METHOD SYNC_METHOD_FSYNC
> #define DEFAULT_SYNC_FLAGBIT 0
> #endif
> #endif
>
> I think the problem is that we prefer O_DSYNC over fdatasync, but do not
> prefer O_FSYNC over fsync.
>
> Running the attached test program shows on BSD/OS 4.3:
>
> write 0.000360
> write & fsync 0.001391
> write, close & fsync 0.001308
> open o_fsync, write 0.000924
>
> showing O_FSYNC faster than fsync().
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania 19073

> /*
> * test_fsync.c
> * tests if fsync can be done from another process than the original write
> */
>
> #include <sys/types.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
> #include <unistd.h>
>
> void die(char *str);
> void print_elapse(struct timeval start_t, struct timeval elapse_t);
>
> int main(int argc, char *argv[])
> {
> struct timeval start_t;
> struct timeval elapse_t;
> int tmpfile;
> char *strout = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
>
> /* write only */
> gettimeofday(&start_t, NULL);
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> write(tmpfile, &strout, 200);
> close(tmpfile);
> gettimeofday(&elapse_t, NULL);
> unlink("/var/tmp/test_fsync.out");
> printf("write ");
> print_elapse(start_t, elapse_t);
> printf("\n");
>
> /* write & fsync */
> gettimeofday(&start_t, NULL);
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> write(tmpfile, &strout, 200);
> fsync(tmpfile);
> close(tmpfile);
> gettimeofday(&elapse_t, NULL);
> unlink("/var/tmp/test_fsync.out");
> printf("write & fsync ");
> print_elapse(start_t, elapse_t);
> printf("\n");
>
> /* write, close & fsync */
> gettimeofday(&start_t, NULL);
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> write(tmpfile, &strout, 200);
> close(tmpfile);
> /* reopen file */
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> fsync(tmpfile);
> close(tmpfile);
> gettimeofday(&elapse_t, NULL);
> unlink("/var/tmp/test_fsync.out");
> printf("write, close & fsync ");
> print_elapse(start_t, elapse_t);
> printf("\n");
>
> /* open_fsync, write */
> gettimeofday(&start_t, NULL);
> if ((tmpfile = open("/var/tmp/test_fsync.out", O_RDWR | O_CREAT | O_FSYNC)) == -1)
> die("can't open /var/tmp/test_fsync.out");
> write(tmpfile, &strout, 200);
> close(tmpfile);
> gettimeofday(&elapse_t, NULL);
> unlink("/var/tmp/test_fsync.out");
> printf("open o_fsync, write ");
> print_elapse(start_t, elapse_t);
> printf("\n");
>
> return 0;
> }
>
> void print_elapse(struct timeval start_t, struct timeval elapse_t)
> {
> if (elapse_t.tv_usec < start_t.tv_usec)
> {
> elapse_t.tv_sec--;
> elapse_t.tv_usec += 1000000;
> }
>
> printf("%ld.%06ld", (long) (elapse_t.tv_sec - start_t.tv_sec),
> (long) (elapse_t.tv_usec - start_t.tv_usec));
> }
>
> void die(char *str)
> {
> fprintf(stderr, "%s", str);
> exit(1);
> }

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-03-18 17:51:03 Re: Further thoughts about warning for costly FK checks
Previous Message Bruce Momjian 2004-03-18 17:34:36 Re: [HACKERS] fsync method checking

Browse pgsql-performance by date

  From Date Subject
Next Message Stephan Szabo 2004-03-18 17:57:50 Re: PostgreSQL Disk Usage and Page Size
Previous Message Bruce Momjian 2004-03-18 17:34:36 Re: [HACKERS] fsync method checking