Re: Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-11-08 02:51:17
Message-ID: CAA4eK1+-2_0QDaqD_FMFfNu=aweCXKuR0gSd7UPjUObU7VuWTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 3, 2016 at 3:55 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Wed, Nov 2, 2016 at 6:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> [ new patches ]
>>
>> I looked over parts of this today, mostly the hashinsert.c changes.
>>
>
>> At the point this code is
>> running, we have a pin but no lock on the metapage, so this is only
>> safe if changing any of these fields requires a cleanup lock on the
>> metapage. If that's true,
>>
>
> No that's not true, we need just Exclusive content lock to update
> those fields and these fields should be copied when we have Share
> content lock on metapage. In version-8 of patch, it was correct, but
> in last version, it seems during code re-arrangement, I have moved it.
> I will change it such that these values are copied under matapage
> share content lock.
>

Fixed as mentioned.

>
>
>> + nblkno = _hash_get_newblk(rel, pageopaque);
>>
>> I think this is not a great name for this function. It's not clear
>> what "new blocks" refers to, exactly. I suggest
>> FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
>> bucket) returning a new bucket number. I think that macro can be
>> defined as something like this: bucket + (1 <<
>> (fls(metap->hashm_maxbucket) - 1)).
>>
>
> I think such a macro would not work for the usage of incomplete
> splits. The reason is that by the time we try to complete the split
> of the current old bucket, the table half (lowmask, highmask,
> maxbucket) would have changed and it could give you the bucket in new
> table half.
>

I have changed the function name to _hash_get_oldbucket_newblock() and
passed the Bucket as a second parameter.

>
>>
>> Moving on ...
>>
>> /*
>> * ovfl page exists; go get it. if it doesn't have room, we'll
>> - * find out next pass through the loop test above.
>> + * find out next pass through the loop test above. Retain the
>> + * pin, if it is a primary bucket page.
>> */
>> - _hash_relbuf(rel, buf);
>> + if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
>> + _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
>> + else
>> + _hash_relbuf(rel, buf);
>>
>> It seems like it would be cheaper, safer, and clearer to test whether
>> buf != bucket_buf here, rather than examining the page opaque data.
>> That's what you do down at the bottom of the function when you ensure
>> that the pin on the primary bucket page gets released, and it seems
>> like it should work up here, too.
>>
>> + bool retain_pin = false;
>> +
>> + /* page flags must be accessed before releasing lock on a page. */
>> + retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
>>
>> Similarly.
>>
>
> Agreed, will change the usage as per your suggestion.
>

Changed as discussed. I have changed the similar usage at few other
places in patch.

>> I have also attached a patch with some suggested comment changes.
>>
>
> I will include it in next version of patch.
>

Included in new version of patch.

>> Some more review...
>>
>> @@ -656,6 +678,10 @@ _hash_squeezebucket(Relation rel,
>> IndexTuple itup;
>> Size itemsz;
>>
>> + /* skip dead tuples */
>> + if (ItemIdIsDead(PageGetItemId(rpage, roffnum)))
>> + continue;
>>
>> Is this an optimization independent of the rest of the patch, or is
>> there something in this patch that necessitates it?
>>
>
> This specific case is independent of rest of patch, but the same
> optimization is used in function _hash_splitbucket_guts() which is
> mandatory, because otherwise it will make a copy of that tuple without
> copying dead flag.
>
>> i.e. Could this
>> small change be committed independently?
>
> Both the places _hash_squeezebucket() and _hash_splitbucket can use
> this optimization irrespective of rest of the patch. I will prepare a
> separate patch for these and send along with main patch after some
> testing.
>

Done as a separate patch skip_dead_tups_hash_index-v1.patch.

>> If not, then I think it
>> needs a better comment explaining why it is now mandatory.
>>
>> - * Caller must hold exclusive lock on the target bucket. This allows
>> + * Caller must hold cleanup lock on the target bucket. This allows
>> * us to safely lock multiple pages in the bucket.
>>
>> The notion of a lock on a bucket no longer really exists; with this
>> patch, we'll now properly speak of a lock on a primary bucket page.
>> Also, I think the bit about safely locking multiple pages is bizarre
>> -- that's not the issue at all: the problem is that reorganizing a
>> bucket might confuse concurrent scans into returning wrong answers.
>>
>> I've included a broader updating of that comment, and some other
>> comment changes, in the attached incremental patch, which also
>> refactors your changes to _hash_freeovflpage() a bit to avoid some
>> code duplication. Please consider this for inclusion in your next
>> version.
>>
>
> Your modifications looks good to me, so will include it in next version.
>

Included in new version of patch.

>> In hashutil.c, I think that _hash_msb() is just a reimplementation of
>> fls(), which you can rely on being present because we have our own
>> implementation in src/port. It's quite similar to yours but slightly
>> shorter. :-) Also, some systems have a builtin fls() function which
>> actually optimizes down to a single machine instruction, and which is
>> therefore much faster than either version.
>>
>
> Agreed, will change as per suggestion.
>

Changed as per suggestion.

>> I don't like the fact that _hash_get_newblk() and _hash_get_oldblk()
>> are working out the bucket number by using the HashOpaque structure
>> within the bucket page they're examining. First, it seems weird to
>> pass the whole structure when you only need the bucket number out of
>> it. More importantly, the caller really ought to know what bucket
>> they care about without having to discover it.
>>
>>
>> So it seems to me that these functions could be simplified to take the
>> bucket number as an argument directly instead of the HashOpaque.
>>
>
> Okay, I agree that it is better to use bucket number in both the
> API's, so will change it accordingly.
>

Changed as per suggestion.

>> Generally, this pattern recurs throughout the patch. Every time you
>> use the data in the page to figure something out which the caller
>> already knew, you're introducing a risk of bugs: what if the answers
>> don't match? I think you should try to root out as much of that from
>> this code as you can.
>>
>
> Okay, I will review the patch once with this angle and see if I can improve it.
>

I have reviewed and found multiple places like hashbucketcleanup(),
_hash_readnext(), _hash_readprev() where such pattern was used.
Changed all such places to ensure that the caller passes the
information if it already has.

Thanks to Ashutosh Sharma who has helped me in ensuring that the
latest patches didn't introduce any concurrency hazards (by testing
with pgbench at high client counts).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
skip_dead_tups_hash_index-v1.patch application/octet-stream 1.0 KB
concurrent_hash_index_v10.patch application/octet-stream 100.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-11-08 02:57:58 Re: Re: BUG #13755: pgwin32_is_service not checking if SECURITY_SERVICE_SID is disabled
Previous Message Tsunakawa, Takayuki 2016-11-08 02:36:41 Re: Re: BUG #13755: pgwin32_is_service not checking if SECURITY_SERVICE_SID is disabled