Re: Checksum errors in pg_stat_database

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Checksum errors in pg_stat_database
Date: 2019-04-02 17:06:35
Message-ID: CABUevEzorGZ-wNzbAhzQj3B+byo20BCWrH-S+tBh+sQykVj2iQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 2, 2019 at 8:47 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:

> On Tue, Apr 02, 2019 at 07:43:12AM +0200, Julien Rouhaud wrote:
> > On Tue, Apr 2, 2019 at 6:56 AM Michael Paquier <michael(at)paquier(dot)xyz>
> wrote:
> >> One thing which is not
> >> proposed on this patch, and I am fine with it as a first draft, is
> >> that we don't have any information about the broken block number and
> >> the file involved. My gut tells me that we'd want a separate view,
> >> like pg_stat_checksums_details with one tuple per (dboid, rel, fork,
> >> blck) to be complete. But that's just for future work.
> >
> > That could indeed be nice.
>
> Actually, backpedaling on this one... pg_stat_checksums_details may
> be a bad idea as we could finish with one row per broken block. If
> a corruption is spreading quickly, pgstat would not be able to sustain
> that amount of objects. Having pg_stat_checksums would allow us to
> plugin more data easily based on the last failure state:
> - last relid of failure
> - last fork type of failure
> - last block number of failure.
> Not saying to do that now, but having that in pg_stat_database does
> not seem very natural to me. And on top of that we would have an
> extra row full of NULLs for shared objects in pg_stat_database if we
> adopt the unique view approach... I find that rather ugly.
>

I think that tracking each and every block is of course a non-starter, as
you've noticed.

I'm really not sure how much those three extra fields help, TBH. As I see
it the real usecase for this is automated monitoring and quick-checks of
the kind of "is my db currently broken somewhere", in combination with "did
this occur recently" (for people who have never looked at their stats).

This gives people enough information to know where to go look in the logs.

I mean, what's the actual usecase for tracking relid/fork/block of the
*last* failure only? To monitor and see if it changes? What do I do when I
have 10 failures, and I only know about the last one? (I have to go to the
logs anyway)

I think having the count and hte last time make sense, but I'm very
sceptical about the rest.

I can somewhat agree that splitting it on a per database level might even
at that be overdoing it. What might actually be more interesting from a
failure-location perspective would be tablespace, rather than any of the
others. Or we could reduce it down to just putting it in pg_stat_bgwriter
and only count global values perhaps, if in the end we don't think the
split-per-database is reasonable?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-04-02 17:41:49 Re: COPY FROM WHEN condition
Previous Message Tom Lane 2019-04-02 17:03:39 Re: Inadequate executor locking of indexes