Re: Online checksums verification in the backend

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Online checksums verification in the backend
Date: 2020-10-23 07:28:32
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 19, 2020 at 04:52:48PM +0900, Michael Paquier wrote:
> No issues with reporting the block number and the fork type in the SRF
> at least of course as this is helpful to detect the position of the
> broken blocks. For the checksum found in the header and the one
> calculated (named expected in the patch), I am less sure how to put a
> clear definition on top of that but we could always consider that
> later and extend the SRF as needed. Once the user knows that both do
> not match, he/she knows that there is a problem.

So, I have reviewed your patch set, and heavily reworked the logic to
be more consistent on many thinks, resulting in a largely simplified
patch without sacrificing its usefulness:
- The logic of the core routine of bufmgr.c is unchanged. I have
simplified it a bit though by merging the subroutines that were part
of the patch. SMgrRelation is used as argument instead of a
Relation. That's more consistent with the surroundings. The initial
read of a page without locks is still on the table as an extra
optimization, though I am not completely sure if this should be part
of CheckBuffer() or not. I also thought about the previous benchmarks
and I think that not using the most-optimized improved performance,
because it reduced the successive runes of the SQL functions, reducing
the back-pressure on the partition locks (we held on of them at the
same time for a longer period, meaning that the other 127 ran free for
a longer time). Please note that this part still referred to a
"module", which was incorrect.
- Removal of the expected and found checksums from the SRF of the
function. Based on my recent business with the page checks for base
backups, I have arrived at the conclusion that the SRF should return
data that we can absolutely trust, and the minimum I think we have to
trust here is if a given page is thought as safe or not, considering
all the sanity checks done by PageIsVerified() as the main entry
point for everything. This has led to a bit of confusion with the
addition of NoComputedChecksum for a page that was empty as of the
initial of the patch, so it happens that we don't need it anymore.
- Removal of the dependency with checksums for this feature. While
simplifying the code, I have noticed that this feature can also be
beneficial for clusters that do not have have data checksums, as
PageIsVerified() is perfectly able to run some page header checks and
the zero-page case. That's of course less useful than having the
checksums, but there is no need to add a dependency here. The file
for the SQL functions is renamed from checksumfuncs.c to pagefuncs.c.
- The function is changed to return no tuples if the relkind is not
supported, and the same applies for temporary relations. That's more
consistent with other system functions like the ones in charge of
partition information, and this makes full scans of pg_class much
easier to work with. Temporary tables were not handled correctly
anyway as these are in local buffers, but the use case of this
function in this case is not really obvious to me.
- Having the forknum in the SRF is kind of confusing, as the user
would need to map a number with the physical on-disk name. Instead, I
have made the choice to return the *path* of the corrupted file with a
block number. This way, an operator can know immediately where a
problem comes from. The patch does not append the segment number, and
I am not sure if we should actually do that, but adding it is
straight-forward as we have the block number. There is a dependency
with table AMs here as well, as this goes down to fd.c, explaining why
I have not added it and just.
- I really don't know what kind of default ACL should apply for such
functions, but I am sure that SCAN_TABLES is not what we are looking
for here, and there is nothing preventing us from having a safe
default from the start of times, so I moved the function to be
superuser-only by default, and GRANT can be used to allow its
execution to other roles. We could relax that in the future, of
course, this can be discussed separately.
- The WARNING report for each block found as corrupted gains an error
context, as a result of a switch to PageIsVerified(), giving a user
all the information needed in the logs on top of the result in the
SRF. That's useful as well if combined with CHECK_FOR_INTERRUPTS(),
and I got to wonder if we should have some progress report for this
stuff, though that's a separate discussion.
- The function is renamed to something less generic,
pg_relation_check_pages(), and I have reduced the number of functions
from two to one, where the user can specify the fork name with a new
option. The default of NULL means that all the forks of a relation
are checked.
- The TAP tests are rather bulky. I have moved all the non-corruption
test cases into a new SQL test file. That's useful for people willing
to do some basic sanity checks with a non-default table AM. At least
it provides a minimum coverage. I have not completely finished its
review, but I have done some work. Doing some debugging of
corrupt_and_test_block() was proving to be rather difficult as the
same test names are assigned multiple times. I am tempted to move
this test suite to src/test/recovery/ instead.
- Reworked the docs and some comments.

That's quite a lot of changes, and I think that most of the C code,
the main tests in src/test/regress/ and the docs are getting in a
rather committable state. The TAP tests emulating corruptions still
need a closer lookup (say, check_pg_stat_database_nb_error() should
have an error prefix at least). The portions in bufmgr.c and the rest
should of course be split into two separate commits, that can easily
be done. And the code needs an indentation run and a catalog bump.

Attachment Content-Type Size
online-checksum-v18.patch text/x-diff 30.2 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-10-23 07:37:18 Re: Tab complete for alter table rls
Previous Message Heikki Linnakangas 2020-10-23 07:04:31 Re: partition routing layering in nodeModifyTable.c