Re: WAL consistency check facility

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: WAL consistency check facility
Date: 2017-01-31 16:06:15
Message-ID: CA+TgmoZ57dyWF7isC+u5TXaP2bL7bc8zrW5F+9gT1tcDMiWt3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
<kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
> I've attached the patch with the modified changes. PFA.

Can this patch check contrib/bloom?

+ /*
+ * Mask some line pointer bits, particularly those marked as
+ * used on a master and unused on a standby.
+ */

Comment doesn't explain why we need to do this.

+ /*
+ * For GIN_DELETED page, the page is initialized to empty.
+ * Hence mask everything.
+ */
+ if (opaque->flags & GIN_DELETED)
+ memset(page_norm, MASK_MARKER, BLCKSZ);
+ else
+ mask_unused_space(page_norm);

If the page is initialized to empty, why do we need to mask
anything/everything? Anyway, it doesn't seem right to mask the
GinPageOpaque structure itself. Or at least it doesn't seem right to
mask the flags word.

+ /*
+ * For gist leaf pages, mask some line pointer bits, particularly
+ * those marked as used on a master and unused on a standby.
+ */

Comment doesn't explain why we need to do this.

+ if (!HeapTupleHeaderXminFrozen(page_htup))
+ page_htup->t_infomask |= HEAP_XACT_MASK;
+ else
+ page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
HEAP_XMAX_INVALID;

Comment doesn't address this logic. Also, the "else" case shouldn't
exist at all, I think.

+ /*
+ * For a speculative tuple, the content of t_ctid is conflicting
+ * between the backup page and current page. Hence, we set it
+ * to the current block number and current offset.
+ */

Why does it differ? Is that a bug?

+ /*
+ * Mask everything on a DELETED page since it will be re-initialized
+ * during replay.
+ */
+ if ((maskopaq->btpo_flags & BTP_DELETED) != 0)
+ {
+ /* Mask Page Content */
+ memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+ BLCKSZ - SizeOfPageHeaderData);
+
+ /* Mask pd_lower and pd_upper */
+ memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER,
+ sizeof(uint16));
+ memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER,
+ sizeof(uint16));

This isn't consistent with the GIN_DELETE case - it is more selective
about what it masks. Probably that logic should be adapted to look
more like this.

+ /*
+ * Mask some line pointer bits, particularly those marked as
+ * used on a master and unused on a standby.
+ */

Comment (still) doesn't explain why we need to do this.

+ /*
+ * During replay of a btree page split, we don't set the BTP_SPLIT_END
+ * flag of the right sibling and initialize th cycle_id to 0 for the same
+ * page.
+ */

A reference to the name of the redo function might be helpful here
(and in some other places), unless it's just ${AMNAME}_redo. "th" is
a typo for "the".

+ appendStringInfoString(buf, " (full page
image, apply)");
+ else
+ appendStringInfoString(buf, " (full page image)");

How about "(full page image)" and "(full page image, for WAL verification)"?

Similarly in XLogDumpDisplayRecord, I think we should assume that
"FPW" means what it has always meant, and leave that output alone.
Instead, distinguish the WAL-consistency-checking case when it
happens, e.g. "(FPW for consistency check)".

+checkConsistency(XLogReaderState *record)

How about CheckXLogConsistency()?

+ * If needs_backup is true or wal consistency check is enabled for

...or WAL checking is enabled...

+ * If WAL consistency is enabled for the resource manager of

If WAL consistency checking is enabled...

+ * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag

with the BKPIMAGE_APPLY flag

- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number. In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
+ * is not set for the backup block, the relation is extended with all-zeroes
+ * pages up to the referenced block number.

OK, I'm puzzled by this. Surely we don't want the WAL consistency
checking facility to cause the relation to be extended like this. And
I don't see why it would, because the WAL consistency checking happens
after the page changes have already been applied. I wonder if this
hunk is in error and should be dropped.

+ PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+ phdr->pd_prune_xid = PG_UINT32_MAX;
+ phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+ phdr->pd_flags |= PD_ALL_VISIBLE;
+#define MASK_MARKER 0xFF
(and many others)

Why do we mask by setting bits rather than clearing bits? My
intuition would have been to zero things we want to mask, rather than
setting them to one.

+ {
+ newwalconsistency[i] = true;
+ }

Superfluous braces.

+ /*
+ * Leave if no masking functions defined, this is possible in the case
+ * resource managers generating just full page writes, comparing an
+ * image to itself has no meaning in those cases.
+ */
+ if (RmgrTable[rmid].rm_mask == NULL)
+ return;

...and also...

+ /*
+ * This feature is enabled only for the resource managers where
+ * a masking function is defined.
+ */
+ for (i = 0; i <= RM_MAX_ID; i++)
+ {
+ if (RmgrTable[i].rm_mask != NULL)

Why do we assume that the feature is only enabled for RMs that have a
mask function? Why not instead assume that if there's no masking
function, no masking is required?

+ /* Definitely not an individual resource manager. Check for 'all'. */
+ if (pg_strcasecmp(tok, "all") == 0)

It seems like it might be cleaner to check for "all" first, and then
check for individual RMs afterward.

+ /*
+ * Parameter should contain either 'all' or a combination of resource
+ * managers.
+ */
+ if (isAll && isRmgrId)
+ {
+ GUC_check_errdetail("Invalid value combination");
+ return false;
+ }

That error message is not very clear, and I don't see why we even need
to check this. If someone sets wal_consistency_checking = hash, all,
let's just set it for all and the fact that hash is also set won't
matter to anything.

+ void (*rm_mask) (char *page, BlockNumber blkno);

Could the page be passed as type "Page" rather than a "char *" to make
things more convenient for the masking functions? If not, could those
functions at least do something like "Page page = (Page) pagebytes;"
rather than "Page page_norm = (Page) page;"?

+ /*
+ * Read the contents from the current buffer and store it in a
+ * temporary page.
+ */
+ buf = XLogReadBufferExtended(rnode, forknum, blkno,
+ RBM_NORMAL);
+ if (!BufferIsValid(buf))
+ continue;
+
+ new_page = BufferGetPage(buf);
+
+ /*
+ * Read the contents from the backup copy, stored in WAL record
+ * and store it in a temporary page. There is not need to allocate
+ * a new page here, a local buffer is fine to hold its contents and
+ * a mask can be directly applied on it.
+ */
+ if (!RestoreBlockImage(record, block_id, old_page_masked))
+ elog(ERROR, "failed to restore block image");
+
+ /*
+ * Take a copy of the new page where WAL has been applied to have
+ * a comparison base before masking it...
+ */
+ memcpy(new_page_masked, new_page, BLCKSZ);
+
+ /* No need for this page anymore now that a copy is in */
+ ReleaseBuffer(buf);

The order of operations is strange here. We read the "new" page,
holding the pin (but no lock?). Then we restore the block image into
old_page_masked. Now we copy the new page and release the pin. It
would be better, ISTM, to rearrange that so that we finish with the
new page and release the pin before dealing with the old page. Also,
I think we need to actually lock the buffer before copying it. Maybe
that's not strictly necessary since this is all happening on the
standby, but it seems like a bad idea to set the precedent that you
can read a page without taking the content lock.

I think the "new" and "old" page terminology is kinda weird too.
Maybe we should call them the "replay image" and the "master image" or
something like that. A few more comments wouldn't hurt either.

+ * Also mask the all-visible flag.
+ *
+ * XXX: It is unfortunate that we have to do this. If the flag is set
+ * incorrectly, that's serious, and we would like to catch it. If the flag
+ * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+ * records don't currently set the flag, even though it is set in the
+ * master, so we must silence failures that that causes.
+ */
+ phdr->pd_flags |= PD_ALL_VISIBLE;

I'm puzzled by the reference to HEAP_CLEAN. The thing that might set
the all-visible bit is XLOG_HEAP2_VISIBLE, not XLOG_HEAP2_CLEAN.
Unless I'm missing something, there's no situation in which
XLOG_HEAP2_CLEAN might be associated with setting PD_ALL_VISIBLE.
Also, XLOG_HEAP2_VISIBLE records do SOMETIMES set the bit, just not
always. And there's a good reason for that, which is explained in
this comment:

* We don't bump the LSN of the heap page when setting the visibility
* map bit (unless checksums or wal_hint_bits is enabled, in which
* case we must), because that would generate an unworkable volume of
* full-page writes. This exposes us to torn page hazards, but since
* we're not inspecting the existing page contents in any way, we
* don't care.
*
* However, all operations that clear the visibility map bit *do* bump
* the LSN, and those operations will only be replayed if the XLOG LSN
* follows the page LSN. Thus, if the page LSN has advanced past our
* XLOG record's LSN, we mustn't mark the page all-visible, because
* the subsequent update won't be replayed to clear the flag.

So I think this comment needs to be rewritten with a bit more nuance.

+extern void mask_unused_space(Page page);
+#endif

Missing newline.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2017-01-31 16:10:24 Re: multivariate statistics (v19)
Previous Message Pavel Stehule 2017-01-31 15:32:57 Re: patch: function xmltable