Re: Write Ahead Logging for Hash Indexes

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Write Ahead Logging for Hash Indexes
Date: 2016-09-22 03:21:13
Message-ID: CAMkU=1z=NzD5XC+q1+qanzTdJx5i7vZkji36rRPYMU=mGcgibQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> >>
> >>
> >> Okay, Thanks for pointing out the same. I have fixed it. Apart from
> >> that, I have changed _hash_alloc_buckets() to initialize the page
> >> instead of making it completely Zero because of problems discussed in
> >> another related thread [1]. I have also updated README.
> >>
> >
> > with v7 of the concurrent has patch and v4 of the write ahead log patch
> and
> > the latest relcache patch (I don't know how important that is to
> reproducing
> > this, I suspect it is not), I once got this error:
> >
> >
> > 38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was
> > interrupted; last known up at 2016-09-19 16:25:49 PDT
> > 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not
> > properly shut down; automatic recovery in progress
> > 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at 3F/2200DE90
> > 38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification
> failed,
> > calculated checksum 65067 but expected 21260
> > 38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at
> 3F/22053B50
> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> > 38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9
> of
> > relation base/16384/17334
> > 38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at
> 3F/22053B50
> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> >
> >
> > The original page with the invalid checksum is:
> >
>
> I think this is a example of torn page problem, which seems to be
> happening because of the below code in your test.
>
> ! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
> !RecoveryInProgress()) {
> ! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
> ! ereport(FATAL,
> ! (errcode(ERRCODE_DISK_FULL),
> ! errmsg("could not write block %u of relation %s: wrote only %d of %d
> bytes",
> ! blocknum,
> ! relpath(reln->smgr_rnode, forknum),
> ! nbytes, BLCKSZ),
> ! errhint("JJ is screwing with the database.")));
> ! } else {
> ! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
> ! }
>
> If you are running the above test by disabling JJ_torn_page, then it
> is a different matter and we need to investigate it, but l assume you
> are running by enabling it.
>
> I think this could happen if the actual change in page is in 2/3 part
> of page which you are not writing in above code. The checksum in page
> header which is written as part of partial page write (1/3 part of
> page) would have considered the actual change you have made whereas
> after restart when it again read the page to apply redo, the checksum
> calculation won't include the change being made in 2/3 part.
>

Correct. But any torn page write must be covered by the restoration of a
full page image during replay, shouldn't it? And that restoration should
happen blindly, without first reading in the old page and verifying the
checksum. Failure to restore the page from a FPI would be a bug. (That
was the purpose for which I wrote this testing harness in the first place,
to verify that the restoration of FPI happens correctly; although most of
the bugs it happens to uncover have been unrelated to that.)

>
> Today, Ashutosh has shared the logs of his test run where he has shown
> similar problem for HEAP page. I think this could happen though
> rarely for any page with the above kind of tests.
>

I think Ashutosh's examples are of warnings, not errors. I think the
warnings occur when replay needs to read in the block (for reason's I don't
understand yet) but then doesn't care if it passes the checksum or not
because it will just be blown away by the replay anyway.

> Does this explanation explains the reason of problem you are seeing?
>

If it can't survive artificial torn page writes, then it probably can't
survive reals ones either. So I am pretty sure it is a bug of some sort.
Perhaps the bug is that it is generating an ERROR when should just be a
WARNING?

>
> >
> > If I ignore the checksum failure and re-start the system, the page gets
> > restored to be a bitmap page.
> >
>
> Okay, but have you ensured in some way that redo is applied to bitmap page?
>

I haven't done that yet. I can't start the system without destroying the
evidence, and I haven't figured out yet how to import a specific block from
a shut-down server into a bytea of a running server, in order to inspect it
using pageinspect.

Today, while thinking on this problem, I realized that currently in
> patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
> problem reported by you [1]. That change will fix the problem
> reported by you, but it will expose bitmap pages for torn-page
> hazards. I think the right fix there is to make pd_lower equal to
> pd_upper for bitmap page, so that full page writes doesn't exclude the
> data in bitmappage.
>

I'm afraid that is over my head. I can study it until it makes sense, but
it will take me a while.

Cheers,

Jeff

> [1] - https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-
> 0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-09-22 04:08:50 Re: Hash Indexes
Previous Message Petr Jelinek 2016-09-22 03:18:20 Re: ICU integration