Re: pageinspect: Hash index support

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
Cc: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pageinspect: Hash index support
Date: 2017-01-27 17:00:49
Message-ID: CA+TgmoZUjrVy52TUU3b_Q5L6jcrt7w5R4qFwMXdeCuKQBmYWqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 24, 2017 at 9:59 PM, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com> wrote:
> On Thu, Jan 19, 2017 at 5:08 PM, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Thanks, Ashutosh and Jesper. I have tested the patch I do not have any
> more comments so making it ready for committer.

I took a look at this patch. I think hash_page_items() is buggy.
It's a little confused about whether it is building an array of datums
(which can be assembled into a tuple using heap_form_tuple) or an
array of cstrings (which can be assembled into a tuple using
BuildTupleFromCStrings). It's mostly the former: the array is of type
Datum, and two of the three values put into it are datums. But the
code that puts the TID in there is stolen from bt_page_items(), which
is constructing an array of cstrings, so it's wrong here. The
practical consequence of this is, I believe, that the TIDs output by
this function will be totally garbled and wrong, which is possibly why
the example in the documentation looks so wacky:

+ 1 | (3407872,12584) | 1053474816
+ 2 | (3407872,12584) | 1053474816

The heap tuple is on page 3407872 at line pointer offset 12584?
That's an awfully but not implausibly big page number (about 26GB into
the relation) and an awfully and implausibly large line pointer offset
(how do we fit 12584 index tuples into an 8192-byte page?). Also, the
fact that this example has two index entries with the same hash code
pointing at the same heap TID seems wrong; wouldn't that result in
index scans returning duplicates? I think what's actually happening
here is that (3407872,12584) is actually the attempt to interpret some
chunk of memory containing the text representation of a TID as an
actual TID. When I print those numbers as hex, I get 0x343128, and
those are the digits "4" and "1" and an opening parenthesis ")" as
ASCII, so that might fit this theory.

The code that tries to extract the hashcode from the item also looks
suspicious. I'm not 100% sure whether it's wrong or just
strangely-written, but it doesn't make much sense to extract the item
contents, then using sprintf() to turn that into a hex string of
bytes, then parse the hex string using strtol() to get an integer
back. I think what it ought to be doing is getting a pointer to the
index tuple and then calling _hash_get_indextuple_hashkey.

Another minor complaint about hash_page_items is that it doesn't
pfree(uargs->page). I'm not sure it's necessary to pfree anything
here, but if we're going to do it I think we might as well be
consistent with the btree case. Anyway it can hardly make sense to
free the 8-byte structure and not the 8192-byte page to which it's
pointing.

In hash_bimap_info(), we go to the trouble of creating a raw page but
never do anything with it. I guess the idea here is just to error out
if the supplied page number is not an overflow page, but there are no
comments explaining that. Anyway, I'm not sure it's a very good idea,
because it means that if you try to write a query to interrogate the
status of all the bitmap pages, it's going to read all of the overflow
pages to which they point, which makes the operation a LOT more
expensive. I think it would be better to leave this part out; if the
user wants to know which pages are actually overflow pages, they can
use hash_page_type() to figure it out. Let's make it the job of this
function just to check the available/free status of the page in the
bitmap, without reading the bitmap itself.

In doing that, I think it should either return (bitmapblkno,
bitmapbit, bit) or just bit. Returning bitmapblkno but not bitmapbit
seems strange. Also, I think it should return bit as a bool, not
int4.

Another general note is that, in general, if you use the
BlahGetDatum() function to construct a datum, the SQL type should be
match the macro you picked - e.g. if the SQL type is int4, the macro
used should be Int32GetDatum(), not UInt32GetDatum() or anything else.
If the SQL type is bool, you should be using BoolGetDatum().

Apart from the above, I did a little work to improve the reporting
when a page of the wrong type is verified. Updated patch with those
changes attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
pageinspect-hashindex-rmh.patch application/octet-stream 26.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-01-27 17:11:38 Re: pg_ls_dir & friends still have a hard-coded superuser check
Previous Message Stephen Frost 2017-01-27 16:56:06 Re: potential hardware donation