Re: Online checksums verification in the backend

From: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Online checksums verification in the backend
Date: 2020-04-03 03:24:50
Message-ID: CA+fd4k4--o41NxwN2kjfRn6W474R-P9D8aL8FW27nQmWm0++hA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>
> On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote:
> > On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> > >
> > > v5 attached
> >
> > Thank you for updating the patch. I have some comments:
>
> Thanks a lot for the review!

Thank you for updating the patch!

> > 4.
> > + /* Check if the relation (still) exists */
> > + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
> > + {
> > + /*
> > + * We use a standard relation_open() to acquire the initial lock. It
> > + * means that this will block until the lock is acquired, or will
> > + * raise an ERROR if lock_timeout has been set. If caller wants to
> > + * check multiple tables while relying on a maximum wait time, it
> > + * should process tables one by one instead of relying on a global
> > + * processing with the main SRF.
> > + */
> > + relation = relation_open(relid, AccessShareLock);
> > + }
> >
> > IIUC the above was necessary because we used to have
> > check_all_relations() which iterates all relations on the database to
> > do checksum checks. But now that we don't have that function and
> > pg_check_relation processes one relation. Can we just do
> > relation_open() here?
>
> Ah yes I missed that comment. I think only the comment needed to be updated to
> remove any part related to NULL-relation call. I ended up removign the whole
> comment since locking and lock_timeout behavior is inherent to relation_open
> and there's no need to document that any further now that we always only check
> one relation at a time.

The current patch still checks SearchSysCacheExists1() before
relation_open. Why do we need to call SearchSysCacheExists1() here? I
think if the given relation doesn't exist, relation_open() will raise
an error "could not open relation with OID %u".

+ /* Open the relation if it exists. */
+ if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+ {
+ relation = relation_open(relid, AccessShareLock);
+ }

> > 8.
> > + if (PageIsVerified(buffer, blkno))
> > + {
> > + /*
> > + * If the page is really new, there won't by any checksum to be
> > + * computed or expected.
> > + */
> > + *chk_expected = *chk_found = NoComputedChecksum;
> > + return true;
> > + }
> > + else
> > + {
> > + /*
> > + * There's a corruption, but since this affect PageIsNew, we
> > + * can't compute a checksum, so set NoComputedChecksum for the
> > + * expected checksum.
> > + */
> > + *chk_expected = NoComputedChecksum;
> > + *chk_found = hdr->pd_checksum;
> > + }
> > + return false;
> >
> > * I think the 'else' is not necessary here.
>
> AFAICT it's, see below.
>
> > * Setting *chk_expected and *chk_found seems useless when we return
> > true. The caller doesn't use them.
>
> Indeed, fixed.

The patch still sets values to both?

+ if (PageIsVerified(buffer, blkno))
+ {
+ /* No corruption. */
+ *chk_expected = *chk_found = NoComputedChecksum;
+ return true;
+ }

>
> > * Should we forcibly overwrite ignore_checksum_failure to off in
> > pg_check_relation()? Otherwise, this logic seems not work fine.
> >
> > * I don't understand why we cannot compute a checksum in case where a
> > page looks like a new page but is actually corrupted. Could you please
> > elaborate on that?
>
> PageIsVerified has a different behavior depending on whether the page looks new
> or not. If the page looks like new, it only checks that it's indeed a new
> page, and otherwise try to verify the checksum.
>
> Also, pg_check_page() has an assert to make sure that the page isn't (or don't
> look like) new.
>
> So it seems to me that the 'else' is required to properly detect a real or fake
> PageIsNew, and try to compute checksums only when required.

Thank you for your explanation! I understand.

I thought we can arrange the code to something like:

if (PageIsNew(hdr))
{
if (PageIsVerified(hdr))
{
*chk_expected = *chk_found = NoComputedChecksum;
return true;
}

*chk_expected = NoComputedChecksum;
*chk_found = hdr->pd_checksum;
return false;
}

But since it's not a critical problem you can ignore it.

>
> > 8.
> > + {
> > + {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
> > + gettext_noop("Checksum cost for a page found in the buffer cache."),
> > + NULL
> > + },
> > + &ChecksumCostPageHit,
> > + 1, 0, 10000,
> > + NULL, NULL, NULL
> > + },
> >
> > * There is no description about the newly added four GUC parameters in the doc.
> >
> > * We need to put new GUC parameters into postgresql.conf.sample as well.
>
> Fixed both.
>
> > * The patch doesn't use checksum_cost_page_hit at all.
>
> Indeed, I also realized that while working on previous issues. I removed it
> and renamed checksum_cost_page_miss to checksum_cost_page.

Perhaps we can use checksum_cost_page_hit when we found the page in
the shared buffer but it's marked as dirty?

> >
> > 9.
> > diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
> > index eb19644419..37f63e747c 100644
> > --- a/src/backend/utils/init/globals.c
> > +++ b/src/backend/utils/init/globals.c
> > @@ -134,6 +134,14 @@ int max_worker_processes = 8;
> > int max_parallel_workers = 8;
> > int MaxBackends = 0;
> >
> > +int ChecksumCostPageHit = 1; /* GUC parameters for
> > checksum check */
> > +int ChecksumCostPageMiss = 10;
> > +int ChecksumCostLimit = 200;
> > +double ChecksumCostDelay = 0;
> > +
> > +int ChecksumCostBalance = 0; /* working state for
> > checksums check */
> > +bool ChecksumCostActive = false;
> >
> > Can we declare them in checksum.c since these parameters are used only
> > in checksum.c and it does I/O my itself.
>
> The GUC parameters would still need to be global, so for consistency I kept all
> the variables in globals.c.

Okay.

> >
> > 10.
> > + /* Report the failure to the stat collector and the logs. */
> > + pgstat_report_checksum_failure();
> > + ereport(WARNING,
> > + (errcode(ERRCODE_DATA_CORRUPTED),
> > + errmsg("invalid page in block %u of relation %s",
> > + blkno,
> > + relpath(relation->rd_smgr->smgr_rnode, forknum))));
> >
> > I think we could do pgstat_report_checksum_failure() and emit WARNING
> > twice for the same page since PageIsVerified() also does the same.
>
> As mentioned before, in this patch I only calls PageIsVerified() if the buffer
> looks like new, and in this case PageIsVerified() only verify that it's a true
> all-zero-page, and won't try to verify the checksum, so there's no possibility
> of duplicated report. I modified the comments to document all the interactions
> and expectations.

You're right. Thank you for the explanation!

I've read the latest patch and here is random comments:

1.
+ /*
+ * Add a page miss cost, as we're always reading outside the shared
+ * buffers.
+ */
+ /* Add a page cost. */
+ ChecksumCostBalance += ChecksumCostPage;

There are duplicate comments.

2.
+ /* Dirty pages are ignored as they'll be flushed soon. */
+ if (buf_state & BM_DIRTY)
+ checkit = false;

Should we check the buffer if it has BM_TAG_VALID as well here? I
thought there might be a possibility that BufTableLookup() returns a
buf_Id but its buffer tag is not valid for example when the previous
read failed after inserting the buffer tag to the buffer table.

3.
+ /* Add a page cost. */
+ ChecksumCostBalance += ChecksumCostPage;
+
+ return checkit;
+}

The check_get_buffer() seems to be slightly complex to me but when we
reached the end of this function we always return true. Similarly, in
the case where we read the block while holding a partition lock we
always return true as well. Is my understanding right? If so, it might
be better to put some assertions.

4.
@@ -10825,6 +10825,14 @@
proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
proargnames => '{tablespace,name,size,modification}',
prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+ proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+ prorows => '20', proretset => 't', proparallel => 'r',
+ provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+ proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+ proargmodes => '{i,i,o,o,o,o,o}',
+ proargnames =>
'{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+ prosrc => 'pg_check_relation' },

Why is the pg_check_relation() is not a strict function? I think
prostrict can be 'true' for this function and we can drop checking if
the first argument is NULL.

5.
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));

I think we can do memset right before setting values to them, that is,
after checking (!found_in_sb && !force_lock).

6.
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+ uint16 *chk_found)
+{
+ PageHeader hdr = (PageHeader) buffer;
+
+ Assert(chk_expected && chk_found);
+
+ if (PageIsNew(hdr))
+ {
+ /*
+ * Check if the page is really new or if there's a corruption that
+ * affected PageIsNew detection. Note that PageIsVerified won't try to
+ * detect checksum corruption in this case, so there's no risk of
+ * duplicated corruption report.
+ */
+ if (PageIsVerified(buffer, blkno))

How about using Page instead of PageHeader? Looking at other codes,
ISTM we usually pass Page to both PageIsNew() and PageIsVerified().

7.
+ <entry>
+ <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>.
+ </entry>

+{ oid => '9147', descr => 'check data integrity for one or all relations',
+ proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+ prorows => '20', proretset => 't', proparallel => 'r',
+ provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+ proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+ proargmodes => '{i,i,o,o,o,o,o}',
+ proargnames =>
'{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+ prosrc => 'pg_check_relation' },

The function argument data types don't match in the doc and function
declaretion. relation is 'oid' in the doc but is 'regclass' in the
function declaretion.

8.
+#define SRF_COLS 5 /* Number of output arguments in the SRF */

Looking at similar built-in functions that return set of records they
use a more specific name for the number of returned columns such as
PG_STAT_GET_WAL_SENDERS_COLS and PG_GET_SHMEM_SIZES_COLS. How about
PG_CHECK_RELATION_COLS?

check_relation_fork() seems to quite depends on pg_check_relation()
because the returned tuplestore is specified by pg_check_relation().
It's just an idea but to improve reusability, how about moving
check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c
while iterating all blocks we call a new function in checksum.c, say
check_one_block() function, which has the following part and is
responsible for getting, checking the specified block and returning a
boolean indicating whether the block has corruption or not, along with
chk_found and chk_expected:

/*
* To avoid too much overhead, the buffer will be first read without
* the locks that would guarantee the lack of false positive, as such
* events should be quite rare.
*/
Retry:
if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
&found_in_sb))
continue;

if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
continue;

/*
* If we get a failure and the buffer wasn't found in shared buffers,
* reread the buffer with suitable lock to avoid false positive. See
* check_get_buffer for more details.
*/
if (!found_in_sb && !force_lock)
{
force_lock = true;
goto Retry;
}

A new function in checksumfuncs.c or pg_check_relation will be
responsible for storing the result to the tuplestore. That way,
check_one_block() will be useful for other use when we want to check
if the particular block has corruption with low overhead.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-04-03 03:28:03 Re: Some problems of recovery conflict wait events
Previous Message Yuri Astrakhan 2020-04-03 03:17:40 Re: Yet another fast GiST build