Re: Print physical file path when checksum check fails

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: hzhang(at)pivotal(dot)io
Cc: andres(at)anarazel(dot)de, thomas(dot)munro(at)gmail(dot)com, sdn(at)amazon(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Print physical file path when checksum check fails
Date: 2020-02-19 04:07:36
Message-ID: 20200219.130736.1644339723387641867.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello. Thank you for the new patch.

At Tue, 18 Feb 2020 09:27:39 +0800, Hubert Zhang <hzhang(at)pivotal(dot)io> wrote in
> On Wed, Feb 12, 2020 at 5:22 PM Hubert Zhang <hzhang(at)pivotal(dot)io> wrote:
>
> > Thanks Andres,
> >
> > On Tue, Feb 11, 2020 at 5:30 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> >> HHi,
> >>
> >> On 2020-02-10 16:04:21 +0800, Hubert Zhang wrote:
> >> > Currently we only print block number and relation path when checksum
> >> check
> >> > fails. See example below:
> >> >
> >> > ERROR: invalid page in block 333571 of relation base/65959/656195
> >>
> >> > DBA complains that she needs additional work to calculate which physical
> >> > file is broken, since one physical file can only contain `RELSEG_SIZE`
> >> > number of blocks. For large tables, we need to use many physical files
> >> with
> >> > additional suffix, e.g. 656195.1, 656195.2 ...
> >> >
> >> > Is that a good idea to also print the physical file path in error
> >> message?
> >> > Like below:
> >> >
> >> > ERROR: invalid page in block 333571 of relation base/65959/656195, file
> >> > path base/65959/656195.2
> >>
> >> I think that'd be a nice improvement. But:
> >>
> >> I don't think the way you did it is right architecturally. The
> >> segmenting is really something that lives within md.c, and we shouldn't
> >> further expose it outside of that. And e.g. the undo patchset uses files
> >> with different segmentation - but still goes through bufmgr.c.
> >>
> >> I wonder if this partially signals that the checksum verification piece
> >> is architecturally done in the wrong place currently? It's imo not good
> >> that every place doing an smgrread() needs to separately verify
> >> checksums. OTOH, it doesn't really belong inside smgr.c.
> >>
> >>
> >> This layering issue was also encountered in
> >>
> >> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3eb77eba5a51780d5cf52cd66a9844cd4d26feb0
> >> so perhaps we should work to reuse the FileTag it introduces to
> >> represent segments, without hardcoding the specific segment size?
> >>
> >>
> > I checked the FileTag commit. It calls `register_xxx_segment` inside md.c
> > to store the sync request into a hashtable and used by checkpointer later.
> >
> > Checksum verify is simpler. We could move the `PageIsVerified` into md.c
> > (mdread). But we can not elog error inside md.c because read buffer mode
> > RBM_ZERO_ON_ERROR is at bugmgr.c level.
> >
> > One idea is to change save the error message(or the FileTag) at (e.g. a
> > static string in bufmgr.c) by calling `register_checksum_failure` inside
> > mdread in md.c.
> >
> > As for your concern about the need to do checksum verify after every
> > smgrread, we now move the checksum verify logic into md.c, but we still
> > need to check the checksum verify result after smgrread and reset buffer to
> > zero if mode is RBM_ZERO_ON_ERROR.
> >
> > If this idea is OK, I will submit the new PR.
> >
> >
> Based on Andres's comments, here is the new patch for moving checksum
> verify logic into mdread() instead of call PageIsVerified in every
> smgrread(). Also using FileTag to print the physical file name when
> checksum verify failed, which handle segmenting inside md.c as well.

The patch doesn't address the first comment from Andres. It still
expose the notion of segment to the upper layer, but bufmgr (bufmgr.c
and bufpage.c) or Realation (relpath.c) layer shouldn't know of
segment. So the two layers should ask smgr without using segment
number for the real file name for the block.

For example, I think the following structure works. (Without moving
checksum verification.)

======
md.c:
char *mdfname(SmgrRelation reln, Forknumber forkNum, BlockNumber blocknum);
smgr.c:
char *smgrfname(SMgrRelation reln, ForkNumber forkNum, BlockNumber Blocknum);

bufmgr.c:
ReadBuffer_common()
{
..
/* check for garbage data */
if (!PageIsVerified((Page) bufBlock, blockNum))
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
ereport(WARNING,
errmsg("invalid page in block %u in file %s; zeroing out page",
blockNum,
smgrfname(smgr, forkNum, blockNum))));
MemSet((char *) bufBlock, 0, BLCKSZ);
====

However, the block number in the error messages looks odd as it is
actually not the block number in the segment. We could convert
BlockNum into relative block number or offset in the file but it would
be overkill. So something like this works?

"invalid page in block %u in relation %u, file \"%s\"",
blockNum, smgr->smgr_rnode.node.relNode, smgrfname()

If we also verify checksum in md layer, callback is overkill since the
immediate caller consumes the event immediately. We can signal the
error by somehow returning a file tag.

======
md.c:
FileTag *mdread(...) /* or void mdread(..., FileTag*)? */
{
if (!VerfyPage())
return ftag;
....
return NULL;
}
char *mdfname(FileTag *ftag);
smgr.c:
FileTag *smgrread(...);
char *smgrfname(FileTag *ftag);

bufmgr.c:
ReadBuffer_common()
{
FileTag *errtag;
..
errftag = smgrread();

if (errtag))
/* page verification failed */
if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
ereport(WARNING,
errmsg("invalid page in block %u in file %s; zeroing out page",
blockNum, smgrfname(errftag)));
MemSet((char *) bufBlock, 0, BLCKSZ);
====

But it is uneasy that smgrread just returning a filetag signals
checksum error..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-02-19 04:08:15 Re: Parallel copy
Previous Message Amit Kapila 2020-02-19 03:56:32 Re: ERROR: subtransaction logged without previous top-level txn record