Re: Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-11-12 05:49:38
Message-ID: CAA4eK1+pnr8jbpf=u64tSboVbhA0cJrzzcoTsF+Eh5_R64SQcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> [ new patches ]
>
> Attached is yet another incremental patch with some suggested changes.
>
> + * This expects that the caller has acquired a cleanup lock on the target
> + * bucket (primary page of a bucket) and it is reponsibility of caller to
> + * release that lock.
>
> This is confusing, because it makes it sound like we retain the lock
> through the entire execution of the function, which isn't always true.
> I would say that caller must acquire a cleanup lock on the target
> primary bucket page before calling this function, and that on return
> that page will again be write-locked. However, the lock might be
> temporarily released in the meantime, which visiting overflow pages.
> (Attached patch has a suggested rewrite.)
>
> + * During scan of overflow pages, first we need to lock the next bucket and
> + * then release the lock on current bucket. This ensures that any concurrent
> + * scan started after we start cleaning the bucket will always be behind the
> + * cleanup. Allowing scans to cross vacuum will allow it to remove tuples
> + * required for sanctity of scan.
>
> This comment says that it's bad if other scans can pass our cleanup
> scan, but it doesn't explain why. I think it's because we don't have
> page-at-a-time mode yet, and cleanup might decrease the TIDs for
> existing index entries. (Attached patch has a suggested rewrite, but
> might need further adjustment if my understanding of the reasons is
> incomplete.)
>

Okay, I have included your changes with minor typo fix and updated
README to use similar language.

> + if (delay)
> + vacuum_delay_point();
>
> You don't really need "delay". If we're not in a cost-accounted
> VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
> which should be safe (and a good idea) regardless. (Fixed in
> attached.)
>

New patch contains this fix.

> + if (callback && callback(htup, callback_state))
> + {
> + /* mark the item for deletion */
> + deletable[ndeletable++] = offno;
> + if (tuples_removed)
> + *tuples_removed += 1;
> + }
> + else if (bucket_has_garbage)
> + {
> + /* delete the tuples that are moved by split. */
> + bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
> ),
> + maxbucket,
> + highmask,
> + lowmask);
> + /* mark the item for deletion */
> + if (bucket != cur_bucket)
> + {
> + /*
> + * We expect tuples to either belong to curent bucket or
> + * new_bucket. This is ensured because we don't allow
> + * further splits from bucket that contains garbage. See
> + * comments in _hash_expandtable.
> + */
> + Assert(bucket == new_bucket);
> + deletable[ndeletable++] = offno;
> + }
> + else if (num_index_tuples)
> + *num_index_tuples += 1;
> + }
> + else if (num_index_tuples)
> + *num_index_tuples += 1;
> + }
>
> OK, a couple things here. First, it seems like we could also delete
> any tuples where ItemIdIsDead, and that seems worth doing. In fact, I
> think we should check it prior to invoking the callback, because it's
> probably quite substantially cheaper than the callback. Second,
> repeating deletable[ndeletable++] = offno and *num_index_tuples += 1
> doesn't seem very clean to me; I think we should introduce a new bool
> to track whether we're keeping the tuple or killing it, and then use
> that to drive which of those things we do. (Fixed in attached.)
>

As discussed up thread, I have included your changes apart from the
change related to ItemIsDead.

> + if (H_HAS_GARBAGE(bucket_opaque) &&
> + !H_INCOMPLETE_SPLIT(bucket_opaque))
>
> This is the only place in the entire patch that use
> H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even
> here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here? (And
> couldn't we then remove H_INCOMPLETE_SPLIT() itself?) There's no
> garbage to be removed from the "new" bucket until the next split, when
> it will take on the role of the "old" bucket.
>

Fixed.

> I think it would be a good idea to change this so that
> LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until
> LH_BUCKET_OLD_PAGE_SPLIT is cleared. The current way is confusing,
> because those tuples are NOT garbage until the split is completed!
> Moreover, both of the places that care about
> LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that
> LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about
> LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would
> actually simplify the code very slightly.
>

Yeah, I have changed as per above suggestion. However, I think with
this change we can only check for garbage flag during vacuum. For
now, I am checking both incomplete split and garbage flag in the
vacuum just to be extra sure, but if you also feel that we can remove
the incomplete split check, then I will do so.

> +#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag &
> LH_BUCKET_OLD_PAGE_SPLIT)
> +#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag &
> LH_BUCKET_NEW_PAGE_SPLIT)
>
> The code isn't consistent about the use of these macros, or at least
> not in a good way. When you care about LH_BUCKET_OLD_PAGE_SPLIT, you
> test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT,
> you ignore the macro and test it directly. Either get rid of both
> macros and always test directly, or keep both macros and use both of
> them. Using a macro for one but not the other is strange.
>

Used macro for both.

> I wonder if we should rename these flags and macros. Maybe
> LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and
> LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED. I think that might be
> clearer. When LH_BEING_POPULATED is set, the bucket is being filled -
> that is, populated - from the old bucket. And maybe
> LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.
>

Changed the names as per discussion up thread.

> + * Copy bucket mapping info now; The comment in _hash_expandtable
> + * where we copy this information and calls _hash_splitbucket explains
> + * why this is OK.
>
> After a semicolon, the next word should not be capitalized. There
> shouldn't be two spaces after a semicolon, either. Also,
> _hash_splitbucket appears to have a verb before it (calls) and a verb
> after it (explains) and I have no idea what that means.
>

Fixed.

> + for (;;)
> + {
> + mask = lowmask + 1;
> + new_bucket = old_bucket | mask;
> + if (new_bucket > metap->hashm_maxbucket)
> + {
> + lowmask = lowmask >> 1;
> + continue;
> + }
> + blkno = BUCKET_TO_BLKNO(metap, new_bucket);
> + break;
> + }
>
> I can't help feeling that it should be possible to do this without
> looping. Can we ever loop more than once? How? Can we just use an
> if-then instead of a for-loop?
>
> Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket
> instead of duplicating the logic?
>

Changed as per discussion up thread.

> I still don't like the names of these functions very much. If you
> said "get X from Y", it would be clear that you put in Y and you get
> out X. If you say "X 2 Y", it would be clear that you put in X and
> you get out Y. As it is, it's not very clear which is the input and
> which is the output.
>

Changed as per discussion up thread.

> + bool primary_buc_page)
>
> I think we could just go with "primary_page" here. (Fixed in attached.)
>

Included the change in attached version of the patch.

> + /*
> + * Acquiring cleanup lock to clear the split-in-progress flag ensures that
> + * there is no pending scan that has seen the flag after it is cleared.
> + */
> + _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
> + opage = BufferGetPage(bucket_obuf);
> + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
>
> I don't understand the comment, because the code *isn't* acquiring a
> cleanup lock.
>

Removed this comment.

>> Some more review:
>>
>> The API contract of _hash_finish_split seems a bit unfortunate. The
>> caller is supposed to have obtained a cleanup lock on both the old and
>> new buffers, but the first thing it does is walk the entire new bucket
>> chain, completely ignoring the old one. That means holding a cleanup
>> lock on the old buffer across an unbounded number of I/O operations --
>> which also means that you can't interrupt the query by pressing ^C,
>> because an LWLock (on the old buffer) is held.
>>
>

Fixed in attached patch as per algorithm proposed few lines down in this mail.

> I see the problem you are talking about, but it was done to ensure
> locking order, old bucket first and then new bucket, else there could
> be a deadlock risk. However, I think we can avoid holding the cleanup
> lock on old bucket till we scan the new bucket to form a hash table of
> TIDs.
>
>> Moreover, the
>> requirement to hold a lock on the new buffer isn't convenient for
>> either caller; they both have to go do it, so why not move it into the
>> function?
>>
>
> Yeah, we can move the locking of new bucket entirely into new function.
>

Done.

>> Perhaps the function should be changed so that it
>> guarantees that a pin is held on the primary page of the existing
>> bucket, but no locks are held.
>>
>
> Okay, so we can change the locking order as follows:
> a. ensure a cleanup lock on old bucket and check if the bucket (old)
> has pending split.
> b. if there is a pending split, release the lock on old bucket, but not pin.
>
> below steps will be performed by _hash_finish_split():
>
> c. acquire the read content lock on new bucket and form the hash table
> of TIDs and in the process of forming hash table, we need to traverse
> whole bucket chain. While traversing bucket chain, release the lock
> on previous bucket (both lock and pin if not a primary bucket page).
> d. After the hash table is formed, acquire cleanup lock on old and new
> buckets conditionaly; if we are not able to get cleanup lock on
> either, then just return from there.
> e. Perform split operation..
> f. release the lock and pin on new bucket
> g. release the lock on old bucket. We don't want to release the pin
> on old bucket as the caller has ensure it before passing it to
> _hash_finish_split(), so releasing pin should be resposibility of
> caller.
>
> Now, both the callers need to ensure that they restart the operation
> from begining as after we release lock on old bucket, somebody might
> have split the bucket.
>
> Does the above change in locking strategy sounds okay?
>

I have changed the locking strategy as per above description by me and
accordingly changed the prototype of _hash_finish_split.

>> Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
>> should it instead be trying to get the lock conditionally and
>> returning immediately if it doesn't get the lock? Seems like a good
>> idea...
>>
>
> Yeah, we can take a cleanup lock conditionally, but it would waste the
> effort of forming hash table, if we don't get cleanup lock
> immediately. Considering incomplete splits to be a rare operation,
> may be this the wasted effort is okay, but I am not sure. Don't you
> think we should avoid that effort?
>

Changed it to conditional lock.

>> * We're at the end of the old bucket chain, so we're done partitioning
>> * the tuples. Mark the old and new buckets to indicate split is
>> * finished.
>> *
>> * To avoid deadlocks due to locking order of buckets, first lock the old
>> * bucket and then the new bucket.
>>
>> These comments have drifted too far from the code to which they refer.
>> The first part is basically making the same point as the
>> slightly-later comment /* indicate that split is finished */.
>>
>
> I think we can remove the second comment /* indicate that split is
> finished */.

Removed this comment.

> itupsize = new_itup->t_info & INDEX_SIZE_MASK;
> new_itup->t_info &= ~INDEX_SIZE_MASK;
> new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
> new_itup->t_info |= itupsize;
>
> If I'm not mistaken, you could omit the first, second, and fourth
> lines here and keep only the third one, and it would do exactly the
> same thing. The first line saves the bits in INDEX_SIZE_MASK. The
> second line clears the bits in INDEX_SIZE_MASK. The fourth line
> re-sets the bits that were originally saved.
>

You are right and I have changed the code as per your suggestion.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
concurrent_hash_index_v11.patch application/octet-stream 101.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message 余森彬 2016-11-12 06:56:56 Re: Why PostgreSQL doesn't implement a semi sync replication?
Previous Message Andreas Karlsson 2016-11-12 05:13:42 Re: Contains and is contained by operators of inet datatypes