Re: WAL consistency check facility

From: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: WAL consistency check facility
Date: 2017-02-08 06:25:10
Message-ID: CAGz5QCKa_n_qV-p_zf7b9HS0gjei=e18NiKurfrg-v9NRBinpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
> <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>> I've attached the patch with the modified changes. PFA.
>
> Can this patch check contrib/bloom?
>
Added support for generic resource manager type.

> + /*
> + * Mask some line pointer bits, particularly those marked as
> + * used on a master and unused on a standby.
> + */
>
> Comment doesn't explain why we need to do this.
>
Added comments.

> + /*
> + * For GIN_DELETED page, the page is initialized to empty.
> + * Hence mask everything.
> + */
> + if (opaque->flags & GIN_DELETED)
> + memset(page_norm, MASK_MARKER, BLCKSZ);
> + else
> + mask_unused_space(page_norm);
>
> If the page is initialized to empty, why do we need to mask
> anything/everything? Anyway, it doesn't seem right to mask the
> GinPageOpaque structure itself. Or at least it doesn't seem right to
> mask the flags word.
>
Modified accordingly.

>
> + if (!HeapTupleHeaderXminFrozen(page_htup))
> + page_htup->t_infomask |= HEAP_XACT_MASK;
> + else
> + page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
> HEAP_XMAX_INVALID;
>
> Comment doesn't address this logic. Also, the "else" case shouldn't
> exist at all, I think.
Added comments. I think that "Else" part is needed for xmax.

>
> + /*
> + * For a speculative tuple, the content of t_ctid is conflicting
> + * between the backup page and current page. Hence, we set it
> + * to the current block number and current offset.
> + */
>
> Why does it differ? Is that a bug?
>
Added comments.

>
> + /*
> + * During replay of a btree page split, we don't set the BTP_SPLIT_END
> + * flag of the right sibling and initialize th cycle_id to 0 for the same
> + * page.
> + */
>
> A reference to the name of the redo function might be helpful here
> (and in some other places), unless it's just ${AMNAME}_redo. "th" is
> a typo for "the".
Corrected.

> + appendStringInfoString(buf, " (full page
> image, apply)");
> + else
> + appendStringInfoString(buf, " (full page image)");
>
> How about "(full page image)" and "(full page image, for WAL verification)"?
>
> Similarly in XLogDumpDisplayRecord, I think we should assume that
> "FPW" means what it has always meant, and leave that output alone.
> Instead, distinguish the WAL-consistency-checking case when it
> happens, e.g. "(FPW for consistency check)".
>
Corrected.

> +checkConsistency(XLogReaderState *record)
>
> How about CheckXLogConsistency()?
>
> + * If needs_backup is true or wal consistency check is enabled for
>
> ...or WAL checking is enabled...
>
> + * If WAL consistency is enabled for the resource manager of
>
> If WAL consistency checking is enabled...
>
> + * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag
>
> with the BKPIMAGE_APPLY flag
Modified accordingly.

>
> - * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
> - * with all-zeroes pages up to the referenced block number. In
> - * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
> + * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
> + * is not set for the backup block, the relation is extended with all-zeroes
> + * pages up to the referenced block number.
>
> OK, I'm puzzled by this. Surely we don't want the WAL consistency
> checking facility to cause the relation to be extended like this. And
> I don't see why it would, because the WAL consistency checking happens
> after the page changes have already been applied. I wonder if this
> hunk is in error and should be dropped.
>
Dropped comments.

> + PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
> + phdr->pd_prune_xid = PG_UINT32_MAX;
> + phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
> + phdr->pd_flags |= PD_ALL_VISIBLE;
> +#define MASK_MARKER 0xFF
> (and many others)
>
> Why do we mask by setting bits rather than clearing bits? My
> intuition would have been to zero things we want to mask, rather than
> setting them to one.
>
Modified accordingly so that we can zero things during masking instead
of setting them to one.

> + {
> + newwalconsistency[i] = true;
> + }
>
> Superfluous braces.
>
Corrected.

> + /*
> + * Leave if no masking functions defined, this is possible in the case
> + * resource managers generating just full page writes, comparing an
> + * image to itself has no meaning in those cases.
> + */
> + if (RmgrTable[rmid].rm_mask == NULL)
> + return;
>
> ...and also...
>
> + /*
> + * This feature is enabled only for the resource managers where
> + * a masking function is defined.
> + */
> + for (i = 0; i <= RM_MAX_ID; i++)
> + {
> + if (RmgrTable[i].rm_mask != NULL)
>
> Why do we assume that the feature is only enabled for RMs that have a
> mask function? Why not instead assume that if there's no masking
> function, no masking is required?
I've introduced a function in rmgr.c, named
consistencyCheck_is_enabled, which returns true if
wal_consistency_checking is enabled for a resource manager. It does
not have any dependency on the masking function. If a masking function
is defined, then we mask the page before consistency check. However,
I'm not sure whether rmgr.c is the right place to define the function
consistencyCheck_is_enabled.

>
> + /* Definitely not an individual resource manager. Check for 'all'. */
> + if (pg_strcasecmp(tok, "all") == 0)
>
> It seems like it might be cleaner to check for "all" first, and then
> check for individual RMs afterward.
>
Done.

> + /*
> + * Parameter should contain either 'all' or a combination of resource
> + * managers.
> + */
> + if (isAll && isRmgrId)
> + {
> + GUC_check_errdetail("Invalid value combination");
> + return false;
> + }
>
> That error message is not very clear, and I don't see why we even need
> to check this. If someone sets wal_consistency_checking = hash, all,
> let's just set it for all and the fact that hash is also set won't
> matter to anything.
>
Modified accordingly.

> + void (*rm_mask) (char *page, BlockNumber blkno);
>
> Could the page be passed as type "Page" rather than a "char *" to make
> things more convenient for the masking functions? If not, could those
> functions at least do something like "Page page = (Page) pagebytes;"
> rather than "Page page_norm = (Page) page;"?
>
Modified it as "Page tempPage = (Page) page;"

> + /*
> + * Read the contents from the current buffer and store it in a
> + * temporary page.
> + */
> + buf = XLogReadBufferExtended(rnode, forknum, blkno,
> + RBM_NORMAL);
> + if (!BufferIsValid(buf))
> + continue;
> +
> + new_page = BufferGetPage(buf);
> +
> + /*
> + * Read the contents from the backup copy, stored in WAL record
> + * and store it in a temporary page. There is not need to allocate
> + * a new page here, a local buffer is fine to hold its contents and
> + * a mask can be directly applied on it.
> + */
> + if (!RestoreBlockImage(record, block_id, old_page_masked))
> + elog(ERROR, "failed to restore block image");
> +
> + /*
> + * Take a copy of the new page where WAL has been applied to have
> + * a comparison base before masking it...
> + */
> + memcpy(new_page_masked, new_page, BLCKSZ);
> +
> + /* No need for this page anymore now that a copy is in */
> + ReleaseBuffer(buf);
>
> The order of operations is strange here. We read the "new" page,
> holding the pin (but no lock?). Then we restore the block image into
> old_page_masked. Now we copy the new page and release the pin. It
> would be better, ISTM, to rearrange that so that we finish with the
> new page and release the pin before dealing with the old page. Also,
> I think we need to actually lock the buffer before copying it. Maybe
> that's not strictly necessary since this is all happening on the
> standby, but it seems like a bad idea to set the precedent that you
> can read a page without taking the content lock.
>
Modified accordingly.

> I think the "new" and "old" page terminology is kinda weird too.
> Maybe we should call them the "replay image" and the "master image" or
> something like that. A few more comments wouldn't hurt either.
>
Done.

> + * Also mask the all-visible flag.
> + *
> + * XXX: It is unfortunate that we have to do this. If the flag is set
> + * incorrectly, that's serious, and we would like to catch it. If the flag
> + * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
> + * records don't currently set the flag, even though it is set in the
> + * master, so we must silence failures that that causes.
> + */
> + phdr->pd_flags |= PD_ALL_VISIBLE;
>
> I'm puzzled by the reference to HEAP_CLEAN. The thing that might set
> the all-visible bit is XLOG_HEAP2_VISIBLE, not XLOG_HEAP2_CLEAN.
> Unless I'm missing something, there's no situation in which
> XLOG_HEAP2_CLEAN might be associated with setting PD_ALL_VISIBLE.
> Also, XLOG_HEAP2_VISIBLE records do SOMETIMES set the bit, just not
> always. And there's a good reason for that, which is explained in
> this comment:
>
> * We don't bump the LSN of the heap page when setting the visibility
> * map bit (unless checksums or wal_hint_bits is enabled, in which
> * case we must), because that would generate an unworkable volume of
> * full-page writes. This exposes us to torn page hazards, but since
> * we're not inspecting the existing page contents in any way, we
> * don't care.
> *
> * However, all operations that clear the visibility map bit *do* bump
> * the LSN, and those operations will only be replayed if the XLOG LSN
> * follows the page LSN. Thus, if the page LSN has advanced past our
> * XLOG record's LSN, we mustn't mark the page all-visible, because
> * the subsequent update won't be replayed to clear the flag.
>
> So I think this comment needs to be rewritten with a bit more nuance.
>
Corrected.

> +extern void mask_unused_space(Page page);
> +#endif
>
> Missing newline.
>
Done.

Thank you Robert for the review. Please find the updated patch in the
attachment.

Thanks to Amit Kapila and Dilip Kumar for their suggestions in offline
discussions.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
walconsistency_v17.patch application/x-download 53.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2017-02-08 06:40:55 Re: DROP SUBSCRIPTION and ROLLBACK
Previous Message Tom Lane 2017-02-08 06:05:08 Re: [PATCH] configure-time knob to set default ssl ciphers