Re: Controlling Load Distributed Checkpoints

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Controlling Load Distributed Checkpoints
Date: 2007-06-07 20:49:17
Message-ID: Pine.GSO.4.64.0706071602360.4005@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Thu, 7 Jun 2007, Gregory Stark wrote:

> You seem to have imagined that letting the checkpoint take longer will slow
> down transactions.

And you seem to have imagined that I have so much spare time that I'm just
making stuff up to entertain myself and sow confusion.

I observed some situations where delaying checkpoints too long ends up
slowing down both transaction rate and response time, using earlier
variants of the LDC patch and code with similar principles I wrote. I'm
trying to keep the approach used here out of the worst of the corner cases
I ran into, or least to make it possible for people in those situations to
have some ability to tune out of the bad spots. I am unfortunately not
free to disclose all those test results, and since that project is over I
can't see how the current LDC compares to what I tested at the time.

I plainly stated I had a bias here, one that's not even close to the
average case. My concern here was that Heikki would end up optimizing in
a direction where a really wide spread across the active checkpoint
interval was strongly preferred. I wanted to offer some suggestions on
the type of situation where that might not be true, but where a different
tuning of LDC would still be an improvement over the current behavior.
There are some tuning knobs there that I don't want to see go away until
there's been a wider range of tests to prove they aren't effective.

> Right now we're seeing tests where Postgres stops handling *any* transactions
> for up to a minute. In virtually any real world scenario that would simply be
> unacceptable.

No doubt; I've seen things get close to that bad myself, both on the high
and low end. I collided with the issue in a situation of "maxing out your
i/o bandwidth, couldn't buy a faster controller" at one point, which is
what kicked off my working in this area. It turned out there were still
some software tunables left that pulled the worst case down to the 2-5
second range instead. With more checkpoint_segments to decrease the
frequency, that was just enough to make the problem annoying rather than
crippling. But after that, I could easily imagine a different application
scenario where the behavior you describe is the best case.

This is really a serious issue with the current design of the database,
one that merely changes instead of going away completely if you throw more
hardware at it. I'm perversely glad to hear this is torturing more people
than just me as it improves the odds the situation will improve.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2007-06-07 20:53:33 Re: Minor changes to Recovery related code
Previous Message Matthew T. O'Connor 2007-06-07 20:24:58 Re: Autovacuum launcher doesn't notice death of postmaster immediately

Browse pgsql-patches by date

  From Date Subject
Next Message Simon Riggs 2007-06-07 20:53:33 Re: Minor changes to Recovery related code
Previous Message Matthew T. O'Connor 2007-06-07 20:24:58 Re: Autovacuum launcher doesn't notice death of postmaster immediately