Re: Load distributed checkpoint

From: "Takayuki Tsunakawa" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
To: "Inaam Rana" <inaamrana(at)gmail(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Load distributed checkpoint
Date: 2006-12-21 00:52:38
Message-ID: 011c01c7249a$4fa27980$19527c0a@OPERAO
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On 12/20/06, Takayuki Tsunakawa <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com> wrote:
> > [Conclusion]
> > I believe that the problem cannot be solved in a real sense by
> > avoiding fsync/fdatasync(). We can't ignore what commercial databases
> > have done so far. The kernel does as much as he likes when PostgreSQL
> > requests him to fsync().

From: Inaam Rana
> I am new to the community and am very interested in the tests that you
have done. I am also working on resolving the sudden IO spikes at checkpoint
time. I agree with you that fsync() is the core issue here.

Thank you for understanding my bad English correctly. Yes, what I've been
insisting is that it is necessary to avoid fsync()/fdatasync() and to use
O_SYNC (plus O_DIRECT if supported on the target platform) to really
eliminate the big spikes.
In my mail, the following sentence made a small mistake.

"I believe that the problem cannot be solved in a real sense by avoiding
fsync/fdatasync()."

The correct sentence is:

"I believe that the problem cannot be solved in a real sense without
avoiding fsync/fdatasync()."

> Being a new member I was wondering if someone on this list has done
testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the
most logical way of dealing with fsync() flood at checkpoint time. If so,
I'll be very interested in the results.

Could you see the mail I sent on Dec 18? Its content was so long that I
zipped the whole content and attached to the mail. I just performed the
same test simply adding O_SYNC to open() in mdopen() and another function in
md.c. I couldn't succeed in running with O_DIRECT because O_DIRECT requires
the shared buffers to be aligned on the sector-size boundary. To perform
O_DIRECT test, a little more modification is necessary to the code where the
shared buffers are allocated.
The result was bad. But that's just a starting point. We need some
improvements that commercial databases have done. I think some approaches
we should take are:

(1) two-checkpoint (described in Jim Gray's textbook "Transaction
Processing: Concepts and Techniques"
(2) what Oracle suggests in its manual (see my previous mails)
(3) write multiple contiguous buffers with one write() to decrease the count
of write() calls

> As mentioned in this thread that a single bgwriter with O_DIRECT will not
be able to keep pace with cleaning effort causing backend writes. I think
(i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this
issue.

I agree with you. Oracle provides a parameter called DB_WRITER_PROCESSES to
set the number of database writer processes. Oracle also provides
asynchronous I/O to solve the problem you are saying about. Please see
section 10.3.9 the following page:

http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14211/instance_tune.htm#sthref1049

> Talking of bgwriter_* parameters I think we are missing a crucial internal
counter i.e. number of dirty pages. How much work bgwriter has to do at each
wakeup call should be a function of total buffers and currently dirty
buffers. Relying on both these values instead of just one static NBuffers
should allow bgwriter to adapt more quickly to workload changes and ensure
that not much work is accumulated for checkpoint.

I agree with you in the sense that the current bgwriter is a bit careless
about the system load. I believe that PostgreSQL should be more gentle to
OLTP transactions -- many users of the system as a result. I think the
speed of WAL accumulation should also be taken into account. Let's list up
the problems and ideas.

--

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Glen Parker 2006-12-21 01:28:56 Patch(es) to expose n_live_tuples and n_dead_tuples to SQL land
Previous Message Russell Smith 2006-12-21 00:27:10 Re: Interface for pg_autovacuum

Browse pgsql-patches by date

  From Date Subject
Next Message Glen Parker 2006-12-21 01:28:56 Patch(es) to expose n_live_tuples and n_dead_tuples to SQL land
Previous Message Roman Kononov 2006-12-20 23:05:14 BUG #2846: inconsistent and confusing handling of underflows, NaNs and INFs