Re: 16-bit page checksums for 9.2

From: Noah Misch <noah(at)leadboat(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-26 20:20:39
Message-ID: 20120126202039.GF15670@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
> v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. One could not simply condition them on the
page_checksums setting, though. Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. If that write tears,
leaving the checksum intact, that block will now fail verification. A couple
of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the
old page had a checksum. (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer. Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards. Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too. The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum. If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data. They would need a dump/reload
to get the performance of a never-checksummed database. Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml

> + <para>
> + Turning this parameter off speeds normal operation, but
> + might allow data corruption to go unnoticed. The checksum uses
> + 16-bit checksums, using the fast Fletcher 16 algorithm. With this
> + parameter enabled there is still a non-zero probability that an error
> + could go undetected, as well as a non-zero probability of false
> + positives.
> + </para>

What sources of false positives do you intend to retain?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -20,6 +20,7 @@
> #include "postgres.h"
>
> #include "access/visibilitymap.h"
> +#include "access/transam.h"
> #include "access/xact.h"
> #include "access/xlogutils.h"
> #include "catalog/catalog.h"
> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
> /* XLOG gives us high 4 bits */
> #define XLOG_SMGR_CREATE 0x10
> #define XLOG_SMGR_TRUNCATE 0x20
> +#define XLOG_SMGR_HINT 0x40
>
> typedef struct xl_smgr_create
> {
> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
> smgrDoPendingDeletes(false);
> }
>
> +/*
> + * Write a backup block if needed when we are setting a hint.
> + *
> + * Deciding the "if needed" bit is delicate and requires us to either
> + * grab WALInsertLock or check the info_lck spinlock. If we check the
> + * spinlock and it says Yes then we will need to get WALInsertLock as well,
> + * so the design choice here is to just go straight for the WALInsertLock
> + * and trust that calls to this function are minimised elsewhere.
> + *
> + * Callable while holding share lock on the buffer content.
> + *
> + * Possible that multiple concurrent backends could attempt to write
> + * WAL records. In that case, more than one backup block may be recorded
> + * though that isn't important to the outcome and the backup blocks are
> + * likely to be identical anyway.
> + */
> +#define SMGR_HINT_WATERMARK 13579
> +void
> +smgr_buffer_hint(Buffer buffer)
> +{
> + /*
> + * Make an XLOG entry reporting the hint
> + */
> + XLogRecPtr lsn;
> + XLogRecData rdata[2];
> + int watermark = SMGR_HINT_WATERMARK;
> +
> + /*
> + * Not allowed to have zero-length records, so use a small watermark
> + */
> + rdata[0].data = (char *) (&watermark);
> + rdata[0].len = sizeof(int);
> + rdata[0].buffer = InvalidBuffer;
> + rdata[0].buffer_std = false;
> + rdata[0].next = &(rdata[1]);
> +
> + rdata[1].data = NULL;
> + rdata[1].len = 0;
> + rdata[1].buffer = buffer;
> + rdata[1].buffer_std = true;
> + rdata[1].next = NULL;
> +
> + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
> +
> + /*
> + * Set the page LSN if we wrote a backup block.
> + */
> + if (!XLByteEQ(InvalidXLogRecPtr, lsn))
> + {
> + Page page = BufferGetPage(buffer);
> + PageSetLSN(page, lsn);

PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.

Need a PageSetTLI() here, too.

> + elog(LOG, "inserted backup block for hint bit");
> + }
> +}

> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c

> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
> if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
> (BM_DIRTY | BM_JUST_DIRTIED))
> {
> + /*
> + * If we're writing checksums and we care about torn pages then we
> + * cannot dirty a page during recovery as a result of a hint.
> + * We can set the hint, just not dirty the page as a result.
> + *
> + * See long discussion in bufpage.c
> + */
> + if (HintsMustNotDirtyPage())
> + return;

This expands to
if (page_checksums && fullPageWrites && RecoveryInProgress())
If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
WAL record during recovery.

> +
> + /*
> + * Write a full page into WAL iff this is the first change on the
> + * block since the last checkpoint. That will never be the case
> + * if the block is already dirty because we either made a change
> + * or set a hint already. Note that aggressive cleaning of blocks
> + * dirtied by hint bit setting would increase the call rate.
> + * Bulk setting of hint bits would reduce the call rate...
> + *
> + * We must issue the WAL record before we mark the buffer dirty.
> + * Otherwise we might write the page before we write the WAL.
> + * That causes a race condition, since a checkpoint might
> + * occur between writing the WAL record and marking the buffer dirty.
> + * We solve that with a kluge, but one that is already in use
> + * during transaction commit to prevent race conditions.
> + * Basically, we simply prevent the checkpoint WAL record from
> + * being written until we have marked the buffer dirty. We don't
> + * start the checkpoint flush until we have marked dirty, so our
> + * checkpoint must flush the change to disk successfully or the
> + * checkpoint never gets written, so crash recovery will set us right.
> + *
> + * XXX rename PGPROC variable later; keep it same now for clarity
> + */
> + MyPgXact->inCommit = true;
> + smgr_buffer_hint(buffer);
> +
> LockBufHdr(bufHdr);
> Assert(bufHdr->refcount > 0);
> if (!(bufHdr->flags & BM_DIRTY))
> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
> }
> bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
> UnlockBufHdr(bufHdr);
> + MyPgXact->inCommit = false;
> }
> }
>
> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 096d36a..a220310 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> /* Find smgr relation for buffer */
> oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
>
> + /* XXX do we want to write checksums for local buffers? An option? */
> +

Don't we have that, now that it happens in mdwrite()?

I can see value in an option to exclude local buffers, since corruption there
may be less exciting. It doesn't seem important for an initial patch, though.

> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c

> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.

I get a 404 for that link.

> --- a/src/include/storage/bufpage.h
> +++ b/src/include/storage/bufpage.h

> -#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
> +#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */

The comment is not consistent with the character of the value.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jaime Casanova 2012-01-26 21:01:58 Re: pg_stats_recovery view
Previous Message Tom Lane 2012-01-26 20:06:38 Re: WIP patch for parameterized inner paths