Re: Page Checksums + Double Writes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: Page Checksums + Double Writes
Date: 2011-12-23 16:56:52
Message-ID: CA+Tgmobo8o-_r1Vdc6kWxRSPWCwpjbquB2ww4epi0dNzQPTFwQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Thoughts?

Those are good thoughts.

Here's another random idea, which might be completely nuts. Maybe we
could consider some kind of summarization of CLOG data, based on the
idea that most transactions commit. We introduce the idea of a CLOG
rollup page. On a CLOG rollup page, each bit represents the status of
N consecutive XIDs. If the bit is set, that means all XIDs in that
group are known to have committed. If it's clear, then we don't know,
and must fall through to a regular CLOG lookup.

If you let N = 1024, then 8K of CLOG rollup data is enough to
represent the status of 64 million transactions, which means that just
a couple of pages could cover as much of the XID space as you probably
need to care about. Also, you would need to replace CLOG summary
pages in memory only very infrequently. Backends could test the bit
without any lock. If it's set, they do pg_read_barrier(), and then
check the buffer label to make sure it's still the summary page they
were expecting. If so, no CLOG lookup is needed. If the page has
changed under us or the bit is clear, then we fall through to a
regular CLOG lookup.

An obvious problem is that, if the abort rate is significantly
different from zero, and especially if the aborts are randomly mixed
in with commits rather than clustered together in small portions of
the XID space, the CLOG rollup data would become useless. On the
other hand, if you're doing 10k tps, you only need to have a window of
a tenth of a second or so where everything commits in order to start
getting some benefit, which doesn't seem like a stretch.

Perhaps the CLOG rollup data wouldn't even need to be kept on disk.
We could simply have bgwriter (or bghinter) set the rollup bits in
shared memory for new transactions, as it becomes possible to do so,
and let lookups for XIDs prior to the last shutdown fall through to
CLOG. Or, if that's not appealing, we could reconstruct the data in
memory by groveling through the CLOG pages - or maybe just set summary
bits only for CLOG pages that actually get faulted in.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-12-23 17:42:42 Re: Page Checksums + Double Writes
Previous Message Alvaro Herrera 2011-12-23 16:35:43 Re: WIP: explain analyze with 'rows' but not timing