Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject: Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date: 2012-07-18 04:00:08
Message-ID: 500634C8.8030302@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On 07/17/2012 06:56 PM, Tom Lane wrote:
> So I went to fix this in the obvious way (attached), but while testing
> it I found that the number of buffers_backend events reported during
> a regression test run barely changed; which surprised the heck out of
> me, so I dug deeper. The cause turns out to be extremely scary:
> ForwardFsyncRequest isn't getting called at all in the bgwriter process,
> because the bgwriter process has a pendingOpsTable.

When I did my testing early this year to look at checkpointer
performance (among other 9.2 write changes like group commit), I did see
some cases where buffers_backend was dramatically different on 9.2 vs.
9.1 There were plenty of cases where the totals across a 10 minute
pgbench were almost identical though, so this issue didn't stick out
then. That's a very different workload than the regression tests though.

> This implies that nobody has done pull-the-plug testing on either HEAD
> or 9.2 since the checkpointer split went in (2011-11-01), because even
> a modicum of such testing would surely have shown that we're failing to
> fsync a significant fraction of our write traffic.

Ugh. Most of my pull the plug testing the last six months has been
focused on SSD tests with older versions. I want to duplicate this (and
any potential fix) now that you've highlighted it.

> Furthermore, I would say that any performance testing done since then,
> if it wasn't looking at purely read-only scenarios, isn't worth the
> electrons it's written on. In particular, any performance gain that
> anybody might have attributed to the checkpointer splitup is very
> probably hogwash.

There hasn't been any performance testing that suggested the
checkpointer splitup was justified. The stuff I did showed it being
flat out negative for a subset of pgbench oriented cases, which didn't
seem real-world enough to disprove it as the right thing to do though.

I thought there were two valid justifications for the checkpointer split
(which is not a feature I have any corporate attachment to--I'm as
isolated from how it was developed as you are). The first is that it
seems like the right architecture to allow reworking checkpoints and
background writes for future write path optimization. A good chunk of
the time when I've tried to improve one of those (like my spread sync
stuff from last year), the code was complicated by the background writer
needing to follow the drum of checkpoint timing, and vice-versa. Being
able to hack on those independently got a sign of relief from me. And
while this adds some code duplication in things like the process setup,
I thought the result would be cleaner for people reading the code to
follow too. This problem is terrible, but I think part of how it crept
in is that the single checkpoint+background writer process was doing way
too many things to even follow all of them some days.

The second justification for the split was that it seems easier to get a
low power result from, which I believe was the angle Peter Geoghegan was
working when this popped up originally. The checkpointer has to run
sometimes, but only at a 50% duty cycle as it's tuned out of the box.
It seems nice to be able to approach that in a way that's power
efficient without coupling it to whatever heartbeat the BGW is running
at. I could even see people changing the frequencies for each
independently depending on expected system load. Tune for lower power
when you don't expect many users, that sort of thing.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-07-18 04:16:20 Re: Using pg_upgrade on log-shipping standby servers
Previous Message Greg Smith 2012-07-18 03:22:22 Re: [PERFORM] DELETE vs TRUNCATE explanation

Browse pgsql-performance by date

  From Date Subject
Next Message Craig Ringer 2012-07-18 04:20:39 Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Previous Message Dave Crooke 2012-07-18 03:51:35 Re: Linux memory zone reclaim