Re: 500 tpsQL + WAL log implementation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Curtis Faith" <curtis(at)galtair(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 500 tpsQL + WAL log implementation
Date: 2002-11-12 01:32:05
Message-ID: 19900.1037064725@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Curtis Faith" <curtis(at)galtair(dot)com> writes:
> Using a raw file partition and a time-based technique for determining the
> optimal write position, I am able to get 8K writes physically written to disk
> synchronously in the range of 500 to 650 writes per second using FreeBSD raw
> device partitions on IDE disks (with write cache disabled).

What can you do *without* using a raw partition?

I dislike that idea for two reasons: portability and security. The
portability disadvantages are obvious. And in ordinary system setups
Postgres would have to run as root in order to write on a raw partition.

It occurs to me that the same technique could be used without any raw
device access. Preallocate a large WAL file and apply the method within
it. You'll have more noise in the measurements due to greater
variability in the physical positioning of the blocks --- but it's
rather illusory to imagine that you know the disk geometry with any
accuracy anyway. Modern drives play a lot of games under the hood.

> The obvious problem with the above mechanism is that the WAL log needs to be
> able to read from the log file in transaction order during recovery. This
> could be provided for using an abstraction that prepends the logical order
> for each block written to the disk and makes sure that the log blocks contain
> either a valid logical order number or some other marker indicating that the
> block is not being used.

This scares me quite a bit too. The reason that the existing
implementation maxes out at one WAL write per rotation is that for small
transactions it's having to repeatedly write the same disk sector. You
could only get around that by writing multiple versions of the same WAL
page at different disk locations. Reliably reconstructing what data to
use is not something that I'm prepared to accept on a handwave...

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2002-11-12 01:51:39 Idea for better handling of cntxDirty
Previous Message am 2002-11-12 01:26:40 Re: geometry test failed (beta5 on Debian)