should we set hint bits without dirtying the page?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: should we set hint bits without dirtying the page?
Date: 2010-12-03 00:00:35
Message-ID: AANLkTi=yB_96nxR42NFFrMShNDAVm=kdmteBLVGo+E73@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In a sleepy email late last night on the crash-safe visibility map
thread, I proposed introducing a new buffer state BM_UNTIDY. When a
page is dirtied by a hint bit update, we mark it untidy but not dirty.
Untidy buffers would be treated as dirty by the background writer
cleaning scan, but as clean by checkpoints and by backends doing
emergency buffer cleaning to feed new allocations. This would have
the effect of rate-limiting the number of buffers that we write just
for hint-bit updates. With default settings, we'd write at most
bgwriter_lru_maxpages * (1000 ms/second / bgwriter_delay) untidy pages
per second, which works out to 4MB/second of write traffic with
default settings. That seems like it might be enough to prevent the
"bulk load followed by SELECT" access pattern from totally swamping
the machine with write traffic, while still ensuring that all the hint
bits eventually do get set.

I then got to wondering whether we should even go a step further, and
simply decree that a page with only hint bit updates is not dirty and
won't be written, period. If your working set fits in RAM, this isn't
really a big deal because you'll read the pages in once, set the hint
bits, and those pages will just stick around. Where it's a problem is
when you have a huge table that you're scanning over and over again,
especially if data in that table was loaded by many different, widely
spaced XIDs that require looking at many different CLOG pages. But
maybe we could ameliorate that problem by freezing more aggressively.
As soon as all tuples on the page are all-visible, VACUUM will freeze
every tuple on the page (setting a HEAP_XMIN_FROZEN bit rather than
actually overwriting XMIN, to preserve forensic information) and mark
it all-visible in a single WAL-logged operation. Also, we could have
the background writer (!) try to perform this same operation on pages
evicted during the cleaning scan. This would impose the same sort of
I/O cap as the previous idea, although it would generate not only page
writes but also WAL activity.

The result would be not only to reduce the number of times we write
the page (which, right now, can be as much as 3 * number_of_tuples, if
we insert, hint-bit update, and then freeze each tuple separately),
but also to make the freezing happen gradually over time rather than
in a sudden spike when the XID age cut-off is reached. This would
also be advantageous for index-only scans, because a large insert only
table would gradually accumulate frozen pages without ever being
vacuumed. The gradual freezing wouldn't apply in all cases - in
particular, if you have a large insert-only table that you never
actually read anything out of, you'd still get a spike when the XID
age cut-off is reached. I'm inclined to think it would still be a big
improvement over the status quo - you'd write the table twice instead
of three times, and the second one would often be spread out rather
than all at once.

I foresee various objections. One is that freezing will force FPIs,
so you'll still be writing the data three times. Of course, if you
count FPIs, we're now writing the data four times, but under this
scheme much more data would stick around long enough to get frozen, so
the objection has merit. However, I think we can avoid this too, by
allocating an additional bit in pd_flags, PD_FPI. Instead of emitting
an FPI when the old LSN precedes the redo pointer, we'll emit an FPI
when the FPI bit is set (in which case we'll also clear the bit) OR
when the old LSN precedes the redo pointer. Upon emitting a WAL
record that is torn-page safe (such as a freeze or all-visible
record), we'll pass a flag to XLogInsert that arranges to suppress
FPIs, bump the LSN, and set PD_FPI. That way, if the page is touched
again before the next checkpoint by an operation that does NOT
suppress FPI, one will be emitted then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-12-03 00:06:29 Re: crash-safe visibility map, take three
Previous Message Bruce Momjian 2010-12-02 23:58:36 Re: We really ought to do something about O_DIRECT and data=journalled on ext4