Quick Links

Re: Write Ahead Logging for Hash Indexes

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Write Ahead Logging for Hash Indexes
Date:	2016-09-21 05:27:29
Message-ID:	CAA4eK1LmQZGnYhSHXDDCOsSb_0U-gsxReEmSDRgCZr=AdKbTEg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>>
>>
>> Okay, Thanks for pointing out the same. I have fixed it. Apart from
>> that, I have changed _hash_alloc_buckets() to initialize the page
>> instead of making it completely Zero because of problems discussed in
>> another related thread [1]. I have also updated README.
>>
>
> with v7 of the concurrent has patch and v4 of the write ahead log patch and
> the latest relcache patch (I don't know how important that is to reproducing
> this, I suspect it is not), I once got this error:
>
>
> 38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was
> interrupted; last known up at 2016-09-19 16:25:49 PDT
> 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not
> properly shut down; automatic recovery in progress
> 38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at 3F/2200DE90
> 38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification failed,
> calculated checksum 65067 but expected 21260
> 38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> 38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9 of
> relation base/16384/17334
> 38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>
>
> The original page with the invalid checksum is:
>

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
! errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
! errhint("JJ is screwing with the database.")));
! } else {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code. The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page. I think this could happen though
rarely for any page with the above kind of tests.

Does this explanation explains the reason of problem you are seeing?

>
> If I ignore the checksum failure and re-start the system, the page gets
> restored to be a bitmap page.
>

Okay, but have you ensured in some way that redo is applied to bitmap page?

Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1]. That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards. I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Re: Write Ahead Logging for Hash Indexes at 2016-09-20 16:54:34 from Jeff Janes

Responses

Re: Write Ahead Logging for Hash Indexes at 2016-09-22 03:21:13 from Jeff Janes

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2016-09-21 06:04:48	Re: Speed up Clog Access by increasing CLOG buffers
Previous Message	Ashutosh Bapat	2016-09-21 05:20:53	Re: [HACKERS] Error running custom plugin: “output plugins have to declare the _PG_output_plugin_init symbol”