Re: Online checksums verification in the backend

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Online checksums verification in the backend
Date: 2020-09-16 03:45:58
Message-ID: 20200916034558.GB11110@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote:
> Thanks!

I got some numbers out of my pocket, using the following base
configuration:
wal_level = minimal
fsync = off
shared_buffers = '300MB' # also tested with 30MB and 60MB
checksum_cost_delay = 0 # default in patch

And for the test I have used pgbench initialized at a scale of 250, to
have close to 3.5GB of data, so as it gives us a sort of 90% buffer
eviction, with all the data cached in the OS (we may want to look as
well at the case where the OS cache does not hold all the relation
pages). I have also run some tests with 30MB and 60MB of shared
buffers, for similar results.

I also applied some prewarming on the cluster:
create extension pg_prewarm
select pg_prewarm(oid) from pg_class where oid > 16000;

Then, I have done five runs using that:
pgbench -S -M prepared -c 64/128/256 -n -T 60
In parallel of that, I got this stuff running in parallel all the
time:
select pg_check_relation('pgbench_accounts');
\watch 0.1

Here are some TPS numbers with the execution time of pg_check_relation.
In the five runs, I removed the highest and lowest ones, then took an
average of the remaining three. I have also tested two modes: with
and without the optimization, that requires a one-liner in checksum.c
as per your latest patch:
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -84,7 +84,7 @@ check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
uint16 *chk_expected, uint16 *chk_found)
{
char buffer[BLCKSZ];
- bool force_lock = false;
+ bool force_lock = true;
bool found_in_sb;

Within parenthesis is the amount of time taken by pg_relation_check()
for a single check. This is not completely exact and I saw some
variations, just to give an impression:
Conns 64 128 256
force_lock=true 60590 (7~8s) 55652 (10.2~10.5s) 46812 (9~10s)
force_lock=false 58637 (5s) 54131 (6~7s) 37091 (1.1~1.2s)

For connections higher than 128, I was kind of surprised to see
pg_relation_check being more aggressive without the optimization, with
much less contention on the buffer mapping LWLock actually, but that's
an impression from looking at pg_stat_activity.

Looking at the wait events for each run, I saw much more hiccups with
the buffer mapping LWLock when forcing the lock rather than not, still
I was able to see some contention when also not forcing the lock. Not
surprising as this rejects a bunch of pages from shared buffers.

> I used all default settings, but by default checksum_cost_delay is 0
> so there shouldn't be any throttling.

Thanks, so did I. From what I can see, there could be as well
benefits in not using the optimization by default so as the relation
check applies some natural throttling by making the checks actually
slower (there is a link between the individual runtime of
pg_relation_time and the TPS). So it also seems to me that the
throttling parameters would be beneficial, but it looks to me that
there is as well a point to not include any throttling in a first
version if the optimization to go full speed is not around. Using
three new GUCs for those function calls is still too much IMO, so
there is also the argument to move all this stuff into a new contrib/
module, and have a bgworker implementation as part of it as it would
need shared_preload_libraries anyway.

Also, I have been putting some thoughts into the APIs able to fetch a
buffer without going through the shared buffers. And neither
checksum.c, because it should be a place that those APIs depends on
and include only the most-internal logic for checksum algorithm and
computation, nor checksumfuncs.c, because we need to think about the
case of a background worker as well (that could spawn a set of dynamic
workers connecting to different databases able to do checksum
verifications?). It would be good to keep the handling of the buffer
mapping lock as well as the calls to smgrread() into a single place.
ReadBuffer_common() is a natural place for that, though it means for
our use case the addition of three new options:
- Being able to pass down directly a buffer pointer to save the page
contents.
- Being able to not verify directly a page, leaving the verification
to the caller upthread.
- Addition of a new mode, that I am calling here RBM_PRIVATE, where we
actually read the page from disk if not yet in shared buffers, except
that we fill in the contents of the page using the pointer given by
the caller. That's different than the use of local buffers, as we
don't need this much amount of complications like temporary tables of
course for per-page checks.

Another idea would be to actually just let ReadBuffer_common just do
the check by itself, with a different mode like a kind of
RBM_VALIDATE, where we just return a verification state of the page
that can be consumed by callers.

This also comes with some more advantages:
- Tracking of reads from disk with track_io_timing.
- Addition of some private stats dedicated to this private mode, with
new fields in pgBufferUsage, all in a single place
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-09-16 03:49:36 Re: Subscription test 013_partition.pl fails under CLOBBER_CACHE_ALWAYS
Previous Message Tom Lane 2020-09-16 03:44:38 Re: recovering from "found xmin ... from before relfrozenxid ..."