Re: Checksum errors in pg_stat_database

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: David Steele <david(at)pgmasters(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksum errors in pg_stat_database
Date: 2019-03-10 12:13:50
Message-ID: CAOBaU_bxJTyOdR2Up_yjoCs06wmF=e9YJKMx_DR=1P+h840Bqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 9, 2019 at 7:58 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
>
> On Sat, Mar 9, 2019 at 7:50 PM Magnus Hagander <magnus(at)hagander(dot)net> wrote:
> >
> > On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> >>
> >> Sorry, I have again new comments after a little bit more thinking.
> >> I'm wondering if we can do something about shared objects while we're
> >> at it. They don't belong to any database, so it's a little bit
> >> orthogonal to this proposal, but it seems quite important to track
> >> error on those too!
> >>
> >> What about adding a new field in PgStat_GlobalStats for that? We can
> >> use the same lastDir to easily detect such objects and slightly adapt
> >> sendFile again, which seems quite straightforward.
>
> > Question is then what number that should show -- only the checksum counter in non-database-fields, or the total number across the cluster?
>
> I'd say only for non-database-fields errors, especially if we can
> reset each counters separately. If necessary, we can add a new view
> to give a global overview of checksum errors for DBA convenience.

So, after reading current implementation, I don't think that
PgStat_GlobalStats is the right place. It's already enough of a mess,
and clearly pg_stat_reset_shared('bgwriter') would not make any sense
if it did reset the shared relations checksum errors (though arguably
the fact that's it's resetting checkpointer stats right now is hardly
better), and handling a different target to reset part of
PgStat_GlobalStats counters would be an ugly kludge.

I'm considering adding a new PgStat_ChecksumStats for that purpose
instead, but I don't know if that's acceptable to do so in the last
commitfest. It seems worthwhile to add it eventually, since we'll
probably end up having more things to report to users related to
checksum. Online enabling of checksum could be the most immediate
potential target.

If that's acceptable, I was thinking this new stat could have those
fields with the first drop:

- number of non-db-related checksum checks done
- number of non-db-related checksum checks failed
- last stats reset

(and adding the number of checks for db-related blocks done with the
current checksum errors counter). Maybe also adding a
pg_checksum_stats view that would summarise all the counters in one
place.

It'll obviously add some traffic to the stats collector, but I'd hope
not too much, since BufferAlloc shouldn't be that frequent, and
BASE_BACKUP reports stats only once per file.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2019-03-10 13:09:13 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Fabien COELHO 2019-03-10 11:26:30 Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)