Potential Large Performance Gain in WAL synching

From: "Curtis Faith" <curtis(at)galtair(dot)com>
To: "Pgsql-Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Potential Large Performance Gain in WAL synching
Date: 2002-10-03 22:26:02
Message-ID: DMEEJMCDOJAKPPFACMPMCEBOCEAA.curtis@galtair.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been looking at the TODO lists and caching issues and think there may
be a way to greatly improve the performance of the WAL.

I've made the following assumptions based on my reading in the manual and
the WAL archives since about November 2000:

1) WAL is currently fsync'd before commit succeeds. This is done to ensure
that the D in ACID is satisfied.
2) The wait on fsync is the biggest time cost for inserts or updates.
3) fsync itself probably increases contention for file i/o on the same file
since some OS file system cache structures must be locked as part of fsync.
Depending on the file system this could be a significant choke on total i/o
throughput.

The issue is that there must be a definite record in durable storage for the
log before one can be certain that a transaction has succeeded.

I'm not familiar with the exact WAL implementation in PostgreSQL but am
familiar with others including ARIES II, however, it seems that it comes
down to making sure that the write to the WAL log has been positively
written to disk.

So, why don't we use files opened with O_DSYNC | O_APPEND for the WAL log
and then use aio_write for all log writes? A transaction would simple do all
the log writing using aio_write and block until all the last log aio request
has completed using aio_waitcomplete. The call to aio_waitcomplete won't
return until the log record has been written to the disk. Opening with
O_DSYNC ensures that when i/o completes the write has been written to the
disk, and aio_write with O_APPEND opened files ensures that writes append in
the order they are received, hence when the aio_write for the last log entry
for a transaction completes, the transaction can be sure that its log
records are in durable storage (IDE problems aside).

It seems to me that this would:

1) Preserve the required D semantics.
2) Allow transactions to complete and do work while other threads are
waiting on the completion of the log write.
3) Obviate the need for commit_delay, since there is no blocking and the
file system and the disk controller can put multiple writes to the log
together as the drive is waiting for the end of the log file to come under
one of the heads.

Here are the relevant TODO's:

Delay fsync() when other backends are about to commit too [fsync]
Determine optimal commit_delay value

Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options
Allow multiple blocks to be written to WAL with one write()

Am I missing something?

Curtis Faith
Principal
Galt Capital, LLP

------------------------------------------------------------------
Galt Capital http://www.galtcapital.com
12 Wimmelskafts Gade
Post Office Box 7549 voice: 340.776.0144
Charlotte Amalie, St. Thomas fax: 340.776.0244
United States Virgin Islands 00801 cell: 340.643.5368

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2002-10-03 22:47:17 Re: Advice: Where could I be of help?
Previous Message Curtis Faith 2002-10-03 22:17:55 Re: Advice: Where could I be of help?