Bgwriter LRU cleaning: we've been going at this all wrong

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject: Bgwriter LRU cleaning: we've been going at this all wrong
Date: 2007-06-26 20:24:55
Message-ID: 28084.1182889495@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I just had an epiphany, I think.

As I wrote in the LDC discussion,
http://archives.postgresql.org/pgsql-patches/2007-06/msg00294.php
if the bgwriter's LRU-cleaning scan has advanced ahead of freelist.c's
clock sweep pointer, then any buffers between them are either clean,
or are pinned and/or have usage_count > 0 (in which case the bgwriter
wouldn't bother to clean them, and freelist.c wouldn't consider them
candidates for re-use). And *this invariant is not destroyed by the
activities of other backends*. A backend cannot dirty a page without
raising its usage_count from zero, and there are no race cases because
the transition states will be pinned.

This means that there is absolutely no point in having the bgwriter
re-start its LRU scan from the clock sweep position each time, as
it currently does. Any pages it revisits are not going to need
cleaning. We might as well have it progress forward from where it
stopped before.

In fact, the notion of the bgwriter's cleaning scan being "in front of"
the clock sweep is entirely backward. It should try to be behind the
sweep, ie, so far ahead that it's lapped the clock sweep and is trailing
along right behind it, cleaning buffers immediately after their
usage_count falls to zero. All the rest of the buffer arena is either
clean or has positive usage_count.

This means that we don't need the bgwriter_lru_percent parameter at all;
all we need is the lru_maxpages limit on how much I/O to initiate per
wakeup. On each wakeup, the bgwriter always cleans until either it's
dumped lru_maxpages buffers, or it's caught up with the clock sweep.

There is a risk that if the clock sweep manages to lap the bgwriter,
the bgwriter would stop upon "catching up", when in reality there are
dirty pages everywhere. This is easily prevented though, if we add
to the shared BufferStrategyControl struct a counter that is incremented
each time the clock sweep wraps around to buffer zero. (Essentially
this counter stores the high-order bits of the sweep counter.) The
bgwriter can then recognize having been lapped by comparing that counter
to its own similar counter. If it does get lapped, it should advance
its work pointer to the current sweep pointer and try to get ahead
again. (There's no point in continuing to clean pages behind the sweep
when those just ahead of it are dirty.)

This idea changes the terms of discussion for Itagaki-san's
automatic-adjustment-of-lru_maxpages patch. I'm not sure we'd still
need it at all, as lru_maxpages would now be just an upper bound on the
desired I/O rate, rather than the target itself. If we do still need
such a patch, it probably needs to look a lot different than it does
now.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-06-26 20:56:13 Re: Bugtraq: Having Fun With PostgreSQL
Previous Message Gregory Stark 2007-06-26 19:47:09 Re: Bugtraq: Having Fun With PostgreSQL