Re: Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-12-11 06:24:54
Message-ID: CAA4eK1Kf6tOY0oVz_SEdngiNFkeXrA3xUSDPPORQvsWVPdKqnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tue, Dec 6, 2016 at 4:00 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> >
>> >
>> > I just occasionally insert a bunch of equal tuples, which have to be in
>> > overflow pages no matter how much splitting happens.
>> >
>> > I am getting vacuum errors against HEAD, after about 20 minutes or so (8
>> > cores).
>> >
>> > 49233 XX002 2016-12-05 23:06:44.087 PST:ERROR: index "foo_index_idx"
>> > contains unexpected zero page at block 64941
>> > 49233 XX002 2016-12-05 23:06:44.087 PST:HINT: Please REINDEX it.
>> > 49233 XX002 2016-12-05 23:06:44.087 PST:CONTEXT: automatic vacuum of
>> > table
>> > "jjanes.public.foo"
>> >
>>
>> Thanks for the report. This can happen due to vacuum trying to access
>> a new page. Vacuum can do so if (a) the calculation for maxbuckets
>> (in metapage) is wrong or (b) it is trying to free the overflow page
>> twice. Offhand, I don't see that can happen in code. I will
>> investigate further to see if there is any another reason why vacuum
>> can access the new page. BTW, have you done the test after commit
>> 2f4193c3, that doesn't appear to be the cause of this problem, but
>> still, it is better to test after that fix. I am trying to reproduce
>> the issue, but in the meantime, if by anychance, you have a callstack,
>> then please share the same.
>
>
> It looks like I compiled the code for testing a few minutes before 2f4193c3
> went in.
>
> I've run it again with cb9dcbc1eebd8, after promoting the ERROR in
> _hash_checkpage, hashutil.c:174 to a PANIC so that I can get a coredump to
> feed to gdb.
>
> This time it was an INSERT, not autovac, that got the error:
>

The reason for this and the similar error in vacuum was that in one of
the corner cases after freeing the overflow page and updating the link
for the previous bucket, we were not marking the buffer as dirty. So,
due to concurrent activity, the buffer containing the updated links
got evicted and then later when we tried to access the same buffer, it
brought back the old copy which contains a link to freed overflow
page.

Apart from above issue, Kuntal has noticed that there is assertion
failure (Assert(bucket == new_bucket);) in hashbucketcleanup with the
same test as provided by you. The reason for that problem was that
after deleting tuples in hashbucketcleanup, we were not marking the
buffer as dirty due to which the old copy of the overflow page was
re-appearing and causing that failure.

After fixing the above problem, it has been noticed that there is
another assertion failure (Assert(bucket == obucket);) in
_hash_splitbucket_guts. The reason for this problem was that after
the split, vacuum failed to remove tuples from the old bucket that are
moved due to split. Now, during next split from the same old bucket,
we don't expect old bucket to contain tuples from the previous split.
To fix this whenever vacuum needs to perform split cleanup, it should
update the metapage values (masks required to calculate bucket
number), so that it shouldn't miss cleaning the tuples.
I believe this is the same assertion what Andreas has reported in
another thread [1].

The next problem we encountered is that after running the same test
for somewhat longer, inserts were failing with error "unexpected zero
page at block ..". After some analysis, I have found that the lock
chain (lock next overflow bucket page before releasing the previous
bucket page) was broken in one corner case in _hash_freeovflpage due
to which insert went ahead than squeeze bucket operation and accessed
the freed overflow page before the link for the same has been updated.

With above fixes, the test ran successfully for more than a day.

Many thanks to Kuntal and Dilip for helping me in analyzing and
testing the fixes for these problems.

[1] - https://www.postgresql.org/message-id/87y3zrzbu5.fsf_-_%40ansel.ydns.eu

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
fix_hashindex_issues_v1.patch application/octet-stream 9.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-12-11 06:33:28 Re: [sqlsmith] Short reads in hash indexes
Previous Message Pavel Stehule 2016-12-11 06:12:32 Re: proposal: psql statements \gstore \gstore_binary (instead COPY RAW)