Re: WAL fsync scheduling

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Vadim Mikheev <vmikheev(at)sectorbase(dot)com>, Tom Samplonius <tom(at)sdf(dot)com>, Alfred(at)candle(dot)pha(dot)pa(dot)us, Perlstein <bright(at)wintelcom(dot)net>, Larry(at)candle(dot)pha(dot)pa(dot)us, Rosenman <ler(at)lerctr(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WAL fsync scheduling
Date: 2001-01-24 14:24:48
Message-ID: 200101241424.JAA15599@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Added to TODO.detail and TODO list.

> [ Charset ISO-8859-1 unsupported, converting... ]
> > > There are two parts to transaction commit. The first is writing all
> > > dirty buffers or log changes to the kernel, and second is fsync of the
> > ^^^^^^^^^^^^
> > Backend doesn't write any dirty buffer to the kernel at commit time.
>
> Yes, I suspected that.
>
> >
> > > log file.
> >
> > The first part is writing commit record into WAL buffers in shmem.
> > This is what XLogInsert does. After that XLogFlush is called to ensure
> > that entire commit record is on disk. XLogFlush does *both* write() and
> > fsync() (single slock is used for both writing and fsyncing) if it needs to
> > do it at all.
>
> Yes, I realize there are new steps in WAL.
>
> >
> > > I suggest having a per-backend shared memory byte that has the following
> > > values:
> > >
> > > START_LOG_WRITE
> > > WAIT_ON_FSYNC
> > > NOT_IN_COMMIT
> > > backend_number_doing_fsync
> > >
> > > I suggest that when each backend starts a commit, it sets its byte to
> > > START_LOG_WRITE.
> > ^^^^^^^^^^^^^^^^^^^^^^^
> > Isn't START_COMMIT more meaningful?
>
> Yes.
>
> >
> > > When it gets ready to fsync, it checks all backends.
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^
> > What do you mean by this? The moment just after XLogInsert?
>
> Just before it calls fsync().
>
> >
> > > If all are NOT_IN_COMMIT, it does fsync and continues.
> >
> > 1st edition:
> > > If one or more are in START_LOG_WRITE, it waits until no one is in
> > > START_LOG_WRITE. It then checks all WAIT_ON_FSYNC, and if it is the
> > > lowest backend in WAIT_ON_FSYNC, marks all others with its backend
> > > number, and does fsync. It then clears all backends with its number to
> > > NOT_IN_COMMIT. Other backend will see they are not the lowest
> > > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
> > > so they can then continue, knowing their data was synced.
> >
> > 2nd edition:
> > > I have another idea. If a backend gets to the point that it needs
> > > fsync, and there is another backend in START_LOG_WRITE, it can go to an
> > > interuptable sleep, knowing another backend will perform the fsync and
> > > wake it up. Therefore, there is no busy-wait or timed sleep.
> > >
> > > Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a
> > > race condition.
> >
> > The 2nd edition is much better. But I'm not sure do we really need in
> > these per-backend bytes in shmem. Why not just have some counters?
> > We can use a semaphore to wake-up all waiters at once.
>
> Yes, that is much better and clearer. My idea was just to say, "if no
> one is entering commit phase, do the commit. If someone else is coming,
> sleep and wait for them to do the fsync and wake me up with a singal."
>
> >
> > > This allows a single backend not to sleep, and allows multiple backends
> > > to bunch up only when they are all about to commit.
> > >
> > > The reason backend numbers are written is so other backends entering the
> > > commit code will not interfere with the backends performing fsync.
> >
> > Being waked-up backend can check what's written/fsynced by calling XLogFlush.
>
> Seems that may not be needed anymore with a counter. The only issue is
> that other backends may enter commit while fsync() is happening. The
> process that did the fsync must be sure to wake up only the backends
> that were waiting for it, and not other backends that may be also be
> doing fsync as a group while the first fsync was happening. I leave
> those details to people more experienced. :-)
>
> I am just glad people liked my idea.
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
> + If your life is a hard drive, | 830 Blythe Avenue
> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zak McGregor 2001-01-24 14:34:36 Re: MySQL has transactions
Previous Message Bruce Momjian 2001-01-24 14:21:24 Re: AW: Postgresql on win32