Re: Checkpoint sync pause

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Checkpoint sync pause
Date: 2012-02-26 03:17:17
Message-ID: CAMkU=1wSTJYRdN9VQiTSE1rD8TSgt0JHWAEfezn-dmm4_g6TLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 12, 2012 at 10:49 PM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com> wrote:
>>> Without sorted checkpoints (or some other fancier method) you have to
>>> write out the entire pool before you can do any fsyncs.  Or you have
>>> to do multiple fsyncs of the same file, with at least one occurring
>>> after the entire pool was written.  With a sorted checkpoint, you can
>>> start issuing once-only fsyncs very early in the checkpoint process.
>>> I think that on large servers, that would be the main benefit, not the
>>> actually more efficient IO.  (On small servers I've seen sorted
>>> checkpoints be much faster on shutdown checkpoints, but not on natural
>>> checkpoints, and presumably this improvement *is* due to better
>>> ordering).
>
>>> On your servers, you need big delays between fsyncs and not between
>>> writes (as they are buffered until the fsync).  But in other
>>> situations, people need the delays between the writes.  By using
>>> sorted checkpoints with fsyncs between each file, the delays between
>>> writes are naturally delays between fsyncs as well.  So I think the
>>> benefit of using sorted checkpoints is that code to improve your
>>> situations is less likely to degrade someone else's situation, without
>>> having to introduce an extra layer of tunables.
>
> What I understood is that you are suggesting, it is better to do sorted
> checkpoints which essentially means flush nearby buffers together.

More importantly, you can issue an fsync after all pages for any given
file are written, thus naturally spreading out the fsyncs instead of
reserving them to until the end, or some arbitrary fraction of the
checkpoint cycle. For this purpose, the buffers only need to be
sorted by physical file they are in, not by block order within the
file.

> However if does this way, might be it will violate Oracle Patent
> (20050044311 - Reducing disk IO by full-cache write-merging). I am not very
> sure about it. But you can refer it once.

Thank you. I was not aware of it, and am constantly astonished what
kinds of things are patentable.

>>> I think the linked list is a bit of a red herring.  Many of the
>>> concepts people discuss implementing on the linked list could just as
>>> easily be implemented with the clock sweep.  And I've seen no evidence
>>> at all that the clock sweep is the problem.  The LWLock that protects
>>> can obviously be a problem, but that seems to be due to the overhead
>>> of acquiring a contended lock, not the work done under the lock.
>>> Reducing the lock-strength around this might be a good idea, but that
>>> reduction could be done just as easily (and as far as I can tell, more
>>> easily) with the clock sweep than the linked list.
>
> with clock-sweep, there are many chances that backend needs to traverse more
> to find a suitable buffer.

Maybe, but I have not seen any evidence that this is the case. My
analyses, experiments, and simulations show that when the buffer
allocations are high, the mere act of running the sweep that often
keeps average useagecount low, so the average sweep is very short.

> However, if clean buffer is put in freelist, it can be directly picked from
> there.

Not directly, you have to take a lock.

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2012-02-26 05:11:15 How to know a table has been modified?
Previous Message Jeff Janes 2012-02-26 01:03:30 Re: Initial 9.2 pgbench write results