Re: Checksum errors in pg_stat_database

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: David Steele <david(at)pgmasters(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksum errors in pg_stat_database
Date: 2019-03-30 14:57:31
Message-ID: CAOBaU_aCWouh_A+moAKQH0+d182pn=g8mrDVCONGuJ7YcvQxFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 30, 2019 at 2:33 PM Magnus Hagander <magnus(at)hagander(dot)net> wrote:
>
> On Wed, Mar 13, 2019 at 4:54 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>>
>> On Wed, Mar 13, 2019 at 4:53 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>> >
>> > As a result I ended up simply adding counters for the number of total
>> > checks and the timestamp of the last failure in PgStat_StatDBEntry,
>> > making attached patch very lightweight. I moved all the checksum
>> > related counters out of pg_stat_database in a new pg_stat_checksum
>> > view. It avoids to make pg_stat_database too wide, and also allows to
>> > display information about shared object in this new view (some of the
>> > other counters don't really make sense for shared objects or could
>> > break existing monitoring query). While at it, I tried to add a
>> > little bit of documentation wrt. checksum monitoring.
>>
>> and of course I forgot to attach the patch.
>
>
> Does it really make any sense to track "number of checksum checks"? In any sort of interesting database that's just going to be an insanely high number, isn't it? (And also, to stay consistent with checksum failures, we should of course also count the checks done in base backups, which is not in the patch. But I'm more thinking we should drop it)

Thanks for looking at it!

It's surely going to be a huge number on databases with a large number
of buffer eviction and/or frequent pg_basebackup. The idea was to be
able to know if the possible lack of failure was due to lack of check
at all or because the server appears to be healthy, without spamming
gettimeofday calls. If having a last_check() is better, I'm fine with
it. If it's useless, let's drop it.

The number of checks was supposed to also be tracked in base_backups, with

@@ -1527,6 +1527,8 @@ sendFile(const char *readfilename, const char
*tarfilename, struct stat *statbuf
"failures in file \"%s\" will not "
"be reported", readfilename)));
}
+ else if (block_retry == false)
+ checksum_checks++;

> Having thought some more about this, I wonder if the right thing to do is to actually add a row to pg_stat_database for the global stats, rather than invent a separate view. I can see the argument going both ways, but particularly with the name pg_stat_checksums we are setting a pattern that will create one view for each counter. That's not very good, I think.
>
> In the end I'm somewhat split on the idea of pg_stat_database with a NULL row or pg_stat_checkpoints. What do others think?

I agree that having a separate view for each counter if a bad idea.
But what I was thinking is that we'll probably end up with a view to
track per-db online checksum activation progress/activity/status at
some point (similar to pg_stat_progress_vacuum), so why not starting
with this dedicated view right now and add new counters later, either
in pgstat and/or some shmem, as long as we keep the view name as SQL
interface.

Anyway, I don't have a strong preference for any implementation, so
I'll be happy to send an updated patch with what ends up being
preferred.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2019-03-30 15:01:50 Re: Checksum errors in pg_stat_database
Previous Message Michael Paquier 2019-03-30 14:31:09 Re: Progress reporting for pg_verify_checksums