Re: Hash Indexes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-11-09 15:40:02
Message-ID: CA+TgmoYLcGy0rXBTrB965Jogp=k4ENYrUv8wkGg9u849bnpq6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> + * This function expects that the caller has acquired a cleanup lock on the
> + * primary bucket page, and will with a write lock again held on the primary
> + * bucket page. The lock won't necessarily be held continuously, though,
> + * because we'll release it when visiting overflow pages.
>
> Looks like typo in above comment. /will with a write lock/will
> return with a write lock

Oh, yes. Thanks.

>> + * During scan of overflow pages, first we need to lock the next bucket and
>> + * then release the lock on current bucket. This ensures that any concurrent
>> + * scan started after we start cleaning the bucket will always be behind the
>> + * cleanup. Allowing scans to cross vacuum will allow it to remove tuples
>> + * required for sanctity of scan.
>>
>> This comment says that it's bad if other scans can pass our cleanup
>> scan, but it doesn't explain why. I think it's because we don't have
>> page-at-a-time mode yet,
>>
>
> Right.
>
>> and cleanup might decrease the TIDs for
>> existing index entries.
>>
>
> I think the reason is that cleanup might move tuples around during
> which it might move previously returned TID to a position earlier than
> its current position. This is a problem because it restarts the scan
> from previously returned offset and try to find previously returned
> tuples TID. This has been mentioned in README as below:
>
> + It is must to
> +keep scans behind cleanup, else vacuum could remove tuples that are required
> +to complete the scan as the scan that returns multiple tuples from the same
> +bucket page always restart the scan from the previous offset number from which
> +it has returned last tuple.
>
> We might want to slightly improve the README so that the reason is
> more clear and then mention in comments to refer README, but I am open
> either way, let me know which way you prefer?

I think we can give a brief explanation right in the code comment. I
referred to "decreasing the TIDs"; you are referring to "moving tuples
around". But I think that moving the tuples to different locations is
not the problem. I think the problem is that a tuple might be
assigned a lower spot in the item pointer array - i.e. the TID
decreases.

>> OK, a couple things here. First, it seems like we could also delete
>> any tuples where ItemIdIsDead, and that seems worth doing.
>
> I think we can't do that because here we want to strictly rely on
> callback function for vacuum similar to btree. The reason is explained
> as below comment in function btvacuumpage().

OK, I see. It would probably be good to comment this, then, so that
someone later doesn't get confused as I did.

> This looks okay to me. So if you agree with my reasoning for not
> including first part, then I can take that out and keep this part in
> next patch.

Cool.

>> I think that might be
>> clearer. When LH_BEING_POPULATED is set, the bucket is being filled -
>> that is, populated - from the old bucket.
>
> How about LH_BUCKET_BEING_POPULATED or may LH_BP_BEING_SPLIT where BP
> indicates Bucket page?

LH_BUCKET_BEING_POPULATED seems good to me.

>> And maybe
>> LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.
>>
>
> How about LH_BUCKET_NEEDS_SPLIT_CLEANUP or LH_BP_NEEDS_SPLIT_CLEANUP?
> I am slightly inclined to keep Bucket word, but let me know if you
> think it will make the length longer.

LH_BUCKET_NEEDS_SPLIT_CLEANUP seems good to me.

>> How? Can we just use an
>> if-then instead of a for-loop?
>
> I could see below two possibilities:
> First way -
>
> retry:
> mask = lowmask + 1;
> new_bucket = old_bucket | mask;
> if (new_bucket > maxbucket)
> {
> lowmask = lowmask >> 1;
> goto retry;
> }
>
> Second way -
> new_bucket = CALC_NEW_BUCKET(old_bucket,lowmask);
> if (new_bucket > maxbucket)
> {
> lowmask = lowmask >> 1;
> new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
> }
>
> #define CALC_NEW_BUCKET(old_bucket, lowmask) \
> new_bucket = old_bucket | (lowmask + 1)
>
> Do you have something else in mind?

Second one would be my preference.

>> I still don't like the names of these functions very much. If you
>> said "get X from Y", it would be clear that you put in Y and you get
>> out X. If you say "X 2 Y", it would be clear that you put in X and
>> you get out Y. As it is, it's not very clear which is the input and
>> which is the output.
>
> Whatever exists earlier is input and the later one is output. For
> example in existing function _hash_get_indextuple_hashkey(). However,
> feel free to suggest better names here. How about
> _hash_get_oldbucket2newblock() or _hash_get_newblock_from_oldbucket()
> or simply _hash_get_newblock()?

The problem with _hash_get_newblock() is that it sounds like you are
getting a new block in the relation, not the new bucket (or
corresponding block) for some old bucket. The name isn't specific
enough to know what "new" means.

In general, I think "new" and "old" are not very good terminology
here. It's not entirely intuitive what they mean, and as soon as it
becomes unclear that you are speaking of something happening *in the
context of a bucket split* then it becomes much less clear. I don't
really have any ideas here that are altogether good; either of your
other two suggestions (not _hash_get_newblock()) seem OK.

>> + /*
>> + * Acquiring cleanup lock to clear the split-in-progress flag ensures that
>> + * there is no pending scan that has seen the flag after it is cleared.
>> + */
>> + _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
>> + opage = BufferGetPage(bucket_obuf);
>> + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
>>
>> I don't understand the comment, because the code *isn't* acquiring a
>> cleanup lock.
>
> Oops, this is ramnant from one of the design approach to clear these
> flags which was later discarded due to issues. I will change this to
> indicate Exclusive lock.

Of course, an exclusive lock doesn't guarantee anything like that...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-11-09 15:47:18 Re: Do we need use more meaningful variables to replace 0 in catalog head files?
Previous Message Amit Langote 2016-11-09 15:29:27 Re: Do we need use more meaningful variables to replace 0 in catalog head files?