Re: Online checksums verification in the backend

From: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Online checksums verification in the backend
Date: 2020-03-28 03:28:27
Message-ID: CA+fd4k4B8QVxhQa8=5pf7SuDcB0vBhqzOn0c59FxVa99CF0ptQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>
> On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote:
> > On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:
> > > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:
> > > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:
> > > >> With a large amount of
> > > >> shared buffer eviction you actually increase the risk of torn page
> > > >> reads. Instead of a logic relying on partition mapping locks, which
> > > >> could be unwise on performance grounds, did you consider different
> > > >> approaches? For example a kind of pre-emptive lock on the page in
> > > >> storage to prevent any shared buffer operation to happen while the
> > > >> block is read from storage, that would act like a barrier.
> > > >
> > > > Even with a workload having a large shared_buffers eviction pattern, I don't
> > > > think that there's a high probability of hitting a torn page. Unless I'm
> > > > mistaken it can only happen if all those steps happen concurrently to doing the
> > > > block read just after releasing the LWLock:
> > > >
> > > > - postgres read the same block in shared_buffers (including all the locking)
> > > > - dirties it
> > > > - writes part of the page
> > > >
> > > > It's certainly possible, but it seems so unlikely that the optimistic lock-less
> > > > approach seems like a very good tradeoff.
> > >
> > > Having false reports in this area could be very confusing for the
> > > user. That's for example possible now with checksum verification and
> > > base backups.
> >
> >
> > I agree, however this shouldn't be the case here, as the block will be
> > rechecked while holding proper lock the 2nd time in case of possible false
> > positive before being reported as corrupted. So the only downside is to check
> > twice a corrupted block that's not found in shared buffers (or concurrently
> > loaded/modified/half flushed). As the number of corrupted or concurrently
> > loaded/modified/half flushed blocks should usually be close to zero, it seems
> > worthwhile to have a lockless check first for performance reason.
>
>
> I just noticed some dumb mistakes while adding the new GUCs. v5 attached to
> fix that, no other changes.

Thank you for updating the patch. I have some comments:

1.
+ <entry>
+ <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>, <parameter>fork</parameter>
<type>text</type>)</function></literal>
+ </entry>

Looking at the declaration of pg_check_relation, 'relation' and 'fork'
are optional arguments. So I think the above is not correct. But as I
commented below, 'relation' should not be optional, so maybe the above
line could be:

+ <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>

2.
+ <indexterm>
+ <primary>pg_check_relation</primary>
+ </indexterm>
+ <para>
+ <function>pg_check_relation</function> iterates over all the blocks of all
+ or the specified fork of a given relation and verify their checksum. It
+ returns the list of blocks for which the found checksum doesn't match the
+ expected one. You must be a member of the
+ <literal>pg_read_all_stats</literal> role to use this function. It can
+ only be used if data checksums are enabled. See <xref
+ linkend="app-initdb-data-checksums"/> for more information.
+ </para>

* I think we need a description about possible values for 'fork'
(i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork'
is omitted.

* Do we need to explain about checksum cost-based delay here?

3.
+CREATE OR REPLACE FUNCTION pg_check_relation(
+ IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+ OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+ OUT expected_checksum integer, OUT found_checksum integer)
+ RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+ PARALLEL RESTRICTED;

Now that pg_check_relation doesn't accept NULL as 'relation', I think
we need to make 'relation' a mandatory argument.

4.
+ /* Check if the relation (still) exists */
+ if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+ {
+ /*
+ * We use a standard relation_open() to acquire the initial lock. It
+ * means that this will block until the lock is acquired, or will
+ * raise an ERROR if lock_timeout has been set. If caller wants to
+ * check multiple tables while relying on a maximum wait time, it
+ * should process tables one by one instead of relying on a global
+ * processing with the main SRF.
+ */
+ relation = relation_open(relid, AccessShareLock);
+ }

IIUC the above was necessary because we used to have
check_all_relations() which iterates all relations on the database to
do checksum checks. But now that we don't have that function and
pg_check_relation processes one relation. Can we just do
relation_open() here?

5.
I think we need to check if the relation is a temp relation. I'm not
sure it's worth to check checksums of temp relations but at least we
need not to check other's temp relations.

6.
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue. Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ * - shared LWLock on the target buffer pool partition mapping.
+ * - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock. In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held. Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page. Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */

I think the above comment also needs some "/*-------" at the beginning and end.

7.
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+ BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+ bool *found_in_sb)
+{

Maybe we can make check_get_buffer() return a bool indicating we found
a buffer to check, instead of having '*checkit'. That way, we can
simplify the following code:

+ check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+ &checkit, &found_in_sb);
+
+ if (!checkit)
+ continue;

to something like:

+ if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+ &found_in_sb))
+ continue;

8.
+ if (PageIsVerified(buffer, blkno))
+ {
+ /*
+ * If the page is really new, there won't by any checksum to be
+ * computed or expected.
+ */
+ *chk_expected = *chk_found = NoComputedChecksum;
+ return true;
+ }
+ else
+ {
+ /*
+ * There's a corruption, but since this affect PageIsNew, we
+ * can't compute a checksum, so set NoComputedChecksum for the
+ * expected checksum.
+ */
+ *chk_expected = NoComputedChecksum;
+ *chk_found = hdr->pd_checksum;
+ }
+ return false;

* I think the 'else' is not necessary here.

* Setting *chk_expected and *chk_found seems useless when we return
true. The caller doesn't use them.

* Should we forcibly overwrite ignore_checksum_failure to off in
pg_check_relation()? Otherwise, this logic seems not work fine.

* I don't understand why we cannot compute a checksum in case where a
page looks like a new page but is actually corrupted. Could you please
elaborate on that?

8.
+ {
+ {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+ gettext_noop("Checksum cost for a page found in the buffer cache."),
+ NULL
+ },
+ &ChecksumCostPageHit,
+ 1, 0, 10000,
+ NULL, NULL, NULL
+ },

* There is no description about the newly added four GUC parameters in the doc.

* We need to put new GUC parameters into postgresql.conf.sample as well.

* The patch doesn't use checksum_cost_page_hit at all.

9.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int max_worker_processes = 8;
int max_parallel_workers = 8;
int MaxBackends = 0;

+int ChecksumCostPageHit = 1; /* GUC parameters for
checksum check */
+int ChecksumCostPageMiss = 10;
+int ChecksumCostLimit = 200;
+double ChecksumCostDelay = 0;
+
+int ChecksumCostBalance = 0; /* working state for
checksums check */
+bool ChecksumCostActive = false;

Can we declare them in checksum.c since these parameters are used only
in checksum.c and it does I/O my itself.

10.
+ /* Report the failure to the stat collector and the logs. */
+ pgstat_report_checksum_failure();
+ ereport(WARNING,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("invalid page in block %u of relation %s",
+ blkno,
+ relpath(relation->rd_smgr->smgr_rnode, forknum))));

I think we could do pgstat_report_checksum_failure() and emit WARNING
twice for the same page since PageIsVerified() also does the same.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Guancheng Luo 2020-03-28 03:28:41 Re: [PATCH] Check operator when creating unique index on partition table
Previous Message Peter Geoghegan 2020-03-28 03:17:54 Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM