Re: Write Ahead Logging for Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Write Ahead Logging for Hash Indexes
Date: 2016-09-22 04:46:48
Message-ID: CAA4eK1+k9tGPw7vOACRo+4h5n=utHnSuGgMoJrLANybAjNBW9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>>
>> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> > wrote:
>> >>
>> >>
>> >> Okay, Thanks for pointing out the same. I have fixed it. Apart from
>> >> that, I have changed _hash_alloc_buckets() to initialize the page
>> >> instead of making it completely Zero because of problems discussed in
>> >> another related thread [1]. I have also updated README.
>> >>
>> >
>> > with v7 of the concurrent has patch and v4 of the write ahead log patch
>> > and
>> > the latest relcache patch (I don't know how important that is to
>> > reproducing
>> > this, I suspect it is not), I once got this error:
>> >
>> >
>> > 38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was
>> > interrupted; last known up at 2016-09-19 16:25:49 PDT
>> > 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not
>> > properly shut down; automatic recovery in progress
>> > 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at
>> > 3F/2200DE90
>> > 38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification
>> > failed,
>> > calculated checksum 65067 but expected 21260
>> > 38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> > 38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9
>> > of
>> > relation base/16384/17334
>> > 38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> >
>> >
>> > The original page with the invalid checksum is:
>> >
>>
>> I think this is a example of torn page problem, which seems to be
>> happening because of the below code in your test.
>>
>> ! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
>> !RecoveryInProgress()) {
>> ! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
>> ! ereport(FATAL,
>> ! (errcode(ERRCODE_DISK_FULL),
>> ! errmsg("could not write block %u of relation %s: wrote only %d of %d
>> bytes",
>> ! blocknum,
>> ! relpath(reln->smgr_rnode, forknum),
>> ! nbytes, BLCKSZ),
>> ! errhint("JJ is screwing with the database.")));
>> ! } else {
>> ! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
>> ! }
>>
>> If you are running the above test by disabling JJ_torn_page, then it
>> is a different matter and we need to investigate it, but l assume you
>> are running by enabling it.
>>
>> I think this could happen if the actual change in page is in 2/3 part
>> of page which you are not writing in above code. The checksum in page
>> header which is written as part of partial page write (1/3 part of
>> page) would have considered the actual change you have made whereas
>> after restart when it again read the page to apply redo, the checksum
>> calculation won't include the change being made in 2/3 part.
>
>
> Correct. But any torn page write must be covered by the restoration of a
> full page image during replay, shouldn't it? And that restoration should
> happen blindly, without first reading in the old page and verifying the
> checksum.
>

Probably, but I think this is not currently the way it is handled and
I don't want to change it. AFAIU, what happens now is that we first
read the old page (and we do page verification while reading old page
and display *warning* if checksum doesn't match) and then restore the
image. The relevant code path is
XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().

Now, here the point is that why instead of WARNING we are seeing FATAL
for the bitmap page of hash index. The reason as explained in my
previous e-mail is that for bitmap page we are not logging full page
image and we should fix that as explained there. Once the full page
image is logged, then first time it reads the torn page, it will use
flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
to WARNING.

I think this is the reason why Ashutosh is seeing WARNING for heap
page whereas you are seeing FATAL for bitmap page of hash index.

I don't think we should try to avoid WARNING for torn pages as part of
this patch, even if that is better course of action, but certainly
FATAL should be fixed and a WARNING should be displayed instead as it
is displayed for heap pages.

>
>> Today, while thinking on this problem, I realized that currently in
>> patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
>> problem reported by you [1]. That change will fix the problem
>> reported by you, but it will expose bitmap pages for torn-page
>> hazards. I think the right fix there is to make pd_lower equal to
>> pd_upper for bitmap page, so that full page writes doesn't exclude the
>> data in bitmappage.
>
>
> I'm afraid that is over my head. I can study it until it makes sense, but
> it will take me a while.
>

Does the explanation above helps, if not, then tell me, I will try to
explain it once more.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rushabh Lathia 2016-09-22 05:36:35 Re: Showing parallel status in \df+
Previous Message Craig Ringer 2016-09-22 04:12:57 Re: 9.6 TAP tests and extensions