Re: Analysis of ganged WAL writes

From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Curtis Faith <curtis(at)galtair(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Pgsql-Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Analysis of ganged WAL writes
Date: 2002-10-07 19:00:13
Message-ID: 1034017213.2562.45.camel@rh72.home.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2002-10-08 at 01:27, Tom Lane wrote:
>
> The scheme we now have (with my recent patch) essentially says that the
> commit delay seen by any one transaction is at most two disk rotations.
> Unfortunately it's also at least one rotation :-(, except in the case
> where there is no contention, ie, no already-scheduled WAL write when
> the transaction reaches the commit stage. It would be nice to be able
> to say "at most one disk rotation" instead --- but I don't see how to
> do that in the absence of detailed information about disk head position.
>
> Something I was toying with this afternoon: assume we have a background
> process responsible for all WAL writes --- not only filled buffers, but
> the currently active buffer. It periodically checks to see if there
> are unwritten commit records in the active buffer, and if so schedules
> a write for them. If this could be done during each disk rotation,
> "just before" the disk reaches the active WAL log block, we'd have an
> ideal solution. And it would not be too hard for such a process to
> determine the right time: it could measure the drive rotational speed
> by observing the completion times of successive writes to the same
> sector, and it wouldn't take much logic to empirically find the latest
> time at which a write can be issued and have a good probability of
> hitting the disk on time. (At least, this would work pretty well given
> a dedicated WAL drive, else there'd be too much interference from other
> I/O requests.)
>
> However, this whole scheme falls down on the same problem we've run into
> before: user processes can't schedule themselves with millisecond
> accuracy. The writer process might be able to determine the ideal time
> to wake up and make the check, but it can't get the Unix kernel to
> dispatch it then, at least not on most Unixen. The typical scheduling
> slop is one time slice, which is comparable to if not more than the
> disk rotation time.

Standard for Linux has been 100Hz time slice, but it is configurable for
some time.

The latest RedHat (8.0) is built with 500Hz that makes about 4
slices/rev for 7200 rpm disks (2 for 15000rpm)

> ISTM aio_write only improves the picture if there's some magic in-kernel
> processing that makes this same kind of judgment as to when to issue the
> "ganged" write for real, and is able to do it on time because it's in
> the kernel. I haven't heard anything to make me think that that feature
> actually exists. AFAIK the kernel isn't much more enlightened about
> physical head positions than we are.

At least for open source kernels it could be possible to

1. write a patch to kernel

or

2. get the authors of kernel aio interested in doing it.

or

3. the third possibility would be using some real-time (RT) OS or mixed
RT/conventional OS where some threads can be scheduled for hard-RT .
In an RT os you are supposed to be able to do exactly what you describe.

I think that 2 and 3 could be "outsourced" (the respective developers
talked into supporting it) as both KAIO and RT Linuxen/BSDs are probably
also inetersted in high-profile applications so they could boast that
"using our stuff enabled PostgreSQL database run twice as fast".

Anyway, getting to near-harware speeds for database will need more
specific support from OS than web browsing or compiling.

---------------
Hannu

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Curtis Faith 2002-10-07 19:12:51 Re: Analysis of ganged WAL writes
Previous Message Hannu Krosing 2002-10-07 18:32:54 Re: Analysis of ganged WAL writes