Re: 16-bit page checksums for 9.2

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-19 16:35:48
Message-ID: CA+U5nMKPGVxg7n8UrgjO0ja8LFqUkNoV2oW5RPAZqzz4e-NMhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> +                       /*
> +                        * If we're in recovery we cannot dirty a page because of a hint.
> +                        * We can set the hint, just not dirty the page as a result so
> +                        * the hint is lost when we evict the page or shutdown.
> +                        *
> +                        * See long discussion in bufpage.c
> +                        */
> +                       if (RecoveryInProgress())
> +                               return;
>
> Doesn't this seem awfully bad for performance on Hot Standby servers?
> I agree that it fixes the problem with un-WAL-logged pages there, but
> I seem to recall some recent complaining about performance features
> that work on the master but not the standby.  Durable hint bits are
> one such feature.

It's impossible for it to work, in this case, since we cannot write
new WAL to prevent torn pages.

Note that hint bit setting on a dirty block is allowed, so many hints
will still be set in Hot Standby.

> +                        * Basically, we simply prevent the checkpoint WAL record from
> +                        * being written until we have marked the buffer dirty. We don't
> +                        * start the checkpoint flush until we have marked dirty, so our
> +                        * checkpoint must flush the change to disk successfully or the
> +                        * checkpoint never gets written, so crash recovery will fix.
> +                        *
> +                        * It's possible we may enter here without an xid, so it is
> +                        * essential that CreateCheckpoint waits for virtual transactions
> +                        * rather than full transactionids.
> +                        */
> +                       MyPgXact->delayChkpt = delayChkpt = true;
>
> I am slightly worried that this expansion in the use of this mechanism
> (formerly called inCommit, for those following along at home) could
> lead to checkpoint starvation.  Suppose we've got one or two large
> table scans wandering along, setting hint bits, and now suddenly it's
> time to checkpoint.  How long will it take the checkpoint process to
> find a time when nobody's got delayChkpt set?

We don't need to wait until nobody has it set, we just need to wait
for the people that had it set when we first checked to be out of that
state momentarily.

> + #define PageSetChecksum(page) \
> + do \
> + { \
> +       PageHeader      p = (PageHeader) page; \
> +       p->pd_flags |= PD_PAGE_VERSION_PLUS1; \
> +       p->pd_flags |= PD_CHECKSUM1; \
> +       p->pd_flags &= ~PD_CHECKSUM2; \
> +       p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
> + } while (0);
> +
> + /* ensure any older checksum info is overwritten with watermark */
> + #define PageResetVersion(page) \
> + do \
> + { \
> +       PageHeader      p = (PageHeader) page; \
> +       if (!PageHasNoChecksum(p)) \
> +       { \
> +               p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \
> +               p->pd_flags &= ~PD_CHECKSUM1; \
> +               p->pd_flags &= ~PD_CHECKSUM2; \
> +               PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
> +       } \
> + } while (0);
>
> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
> doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
> do?

As explained in detailed comments, the purpose of this is to implement
Heikki's suggestion that we have a bit set to zero so we can detect
failures that cause a run of 1s.

>   * PageGetPageSize
>   *            Returns the page size of a page.
>   *
> !  * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
> !  * This can be called on any page, initialised or not, in or out of buffers.
> !  * You might think this can vary at runtime but you'd be wrong, since pages
> !  * frequently need to occupy buffers and pages are copied from one to another
> !  * so there are many hidden assumptions that this simple definition is true.
>   */
> ! #define PageGetPageSize(page) (BLCKSZ)
>
> I think I agree that the current definition of PageGetPageSize seems
> unlikely to come anywhere close to allowing us to cope with multiple
> page sizes, but I think this method of disabling it is a hack.  The
> callers that want to know how big the page really is should just use
> BLCKSZ instead of this macro, and those that want to know how big the
> page THINKS it is (notably contrib/pageinspect) need a way to get that
> information.

Fair comment.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2012-02-19 16:50:37 pgsql: Improve pretty printing of viewdefs.
Previous Message Larry 2012-02-19 15:40:35 Re: Proposal: XML helper functions