Re: Analysis of ganged WAL writes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Curtis Faith" <curtis(at)galtair(dot)com>
Cc: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Hannu Krosing" <hannu(at)tm(dot)ee>, "Pgsql-Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Analysis of ganged WAL writes
Date: 2002-10-07 20:27:06
Message-ID: 25840.1034022426@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Curtis Faith" <curtis(at)galtair(dot)com> writes:
> Even the theoretical limit you mention of one transaction per revolution
> per committing process seem like a significant bottleneck.

Well, too bad. If you haven't gotten your commit record down to disk,
then *you have not committed*. This is not negotiable. (If you think
it is, then turn off fsync and quit worrying ;-))

An application that is willing to have multiple transactions in flight
at the same time can open up multiple backend connections to issue those
transactions, and thereby perhaps beat the theoretical limit. But for
serial transactions, there is not anything we can do to beat that limit.
(At least not with the log structure we have now. One could imagine
dropping a commit record into the nearest one of multiple buckets that
are carefully scattered around the disk. But exploiting that would take
near-perfect knowledge about disk head positioning; it's even harder to
solve than the problem we're considering now.)

> I still think that it would be much faster to just keep writing the WAL log
> blocks when they fill up and have a separate process wake the commiting
> process when the write completes. This would eliminate WAL writing as a
> bottleneck.

You're failing to distinguish total throughput to the WAL drive from
response time seen by any one transaction. Yes, a policy of writing
each WAL block once when it fills would maximize potential throughput,
but it would also mean a potentially very large delay for a transaction
waiting to commit. The lower the system load, the worse the performance
on that scale.

The scheme we now have (with my recent patch) essentially says that the
commit delay seen by any one transaction is at most two disk rotations.
Unfortunately it's also at least one rotation :-(, except in the case
where there is no contention, ie, no already-scheduled WAL write when
the transaction reaches the commit stage. It would be nice to be able
to say "at most one disk rotation" instead --- but I don't see how to
do that in the absence of detailed information about disk head position.

Something I was toying with this afternoon: assume we have a background
process responsible for all WAL writes --- not only filled buffers, but
the currently active buffer. It periodically checks to see if there
are unwritten commit records in the active buffer, and if so schedules
a write for them. If this could be done during each disk rotation,
"just before" the disk reaches the active WAL log block, we'd have an
ideal solution. And it would not be too hard for such a process to
determine the right time: it could measure the drive rotational speed
by observing the completion times of successive writes to the same
sector, and it wouldn't take much logic to empirically find the latest
time at which a write can be issued and have a good probability of
hitting the disk on time. (At least, this would work pretty well given
a dedicated WAL drive, else there'd be too much interference from other
I/O requests.)

However, this whole scheme falls down on the same problem we've run into
before: user processes can't schedule themselves with millisecond
accuracy. The writer process might be able to determine the ideal time
to wake up and make the check, but it can't get the Unix kernel to
dispatch it then, at least not on most Unixen. The typical scheduling
slop is one time slice, which is comparable to if not more than the
disk rotation time.

ISTM aio_write only improves the picture if there's some magic in-kernel
processing that makes this same kind of judgment as to when to issue the
"ganged" write for real, and is able to do it on time because it's in
the kernel. I haven't heard anything to make me think that that feature
actually exists. AFAIK the kernel isn't much more enlightened about
physical head positions than we are.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2002-10-07 20:28:48 Re: Dirty Buffer Writing [was Proposed LogWriter Scheme]
Previous Message Bruce Momjian 2002-10-07 20:22:15 Statistical Analysis, Vacuum, and Selectivity Restriction (PostGIS) (fwd)