Re: Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-11-23 13:50:10
Message-ID: CAA4eK1LtONpESxHURD6KLLYxB0Pj8WXvxb=-VZg2_hEruW3BUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> You are right and I have changed the code as per your suggestion.
>
> So...
>
> + /*
> + * We always maintain the pin on bucket page for whole scan operation,
> + * so releasing the additional pin we have acquired here.
> + */
> + if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
> + _hash_dropbuf(rel, *bufp);
>
> This relies on the page contents to know whether we took a pin; that
> seems like a bad plan. We need to know intrinsically whether we took
> a pin.
>

Okay, changed to not rely on page contents.

> + * If the bucket split is in progress, then we need to skip tuples that
> + * are moved from old bucket. To ensure that vacuum doesn't clean any
> + * tuples from old or new buckets till this scan is in progress, maintain
> + * a pin on both of the buckets. Here, we have to be cautious about
>
> It wouldn't be a problem if VACUUM removed tuples from the new bucket,
> because they'd have to be dead anyway. It also wouldn't be a problem
> if it removed tuples from the old bucket that were actually dead. The
> real issue isn't vacuum anyway, but the process of cleaning up after a
> split. We need to hold the pin so that tuples being moved from the
> old bucket to the new bucket by the split don't get removed from the
> old bucket until our scan is done.
>

Updated comments to explain clearly.

> + old_blkno = _hash_get_oldblock_from_newbucket(rel,
> opaque->hasho_bucket);
>
> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
> feel like I'm repeating this ad nauseum, but I really think it's bad
> to rely on the special space instead of our own local variables!
>

Okay, changed as per suggestion.

> - /* we ran off the end of the bucket without finding a match */
> + /*
> + * We ran off the end of the bucket without finding a match.
> + * Release the pin on bucket buffers. Normally, such pins are
> + * released at end of scan, however scrolling cursors can
> + * reacquire the bucket lock and pin in the same scan multiple
> + * times.
> + */
> *bufP = so->hashso_curbuf = InvalidBuffer;
> ItemPointerSetInvalid(current);
> + _hash_dropscanbuf(rel, so);
>
> I think this comment is saying that we'll release the pin on the
> primary bucket page for now, and then reacquire it later if the user
> reverses the scan direction. But that doesn't sound very safe,
> because the bucket could be split in the meantime and the order in
> which tuples are returned could change. I think we want that to
> remain stable within a single query execution.
>

As explained [1], this shouldn't be a problem.

> + _hash_readnext(rel, &buf, &page, &opaque,
> + (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
>
> Same comment: don't rely on the special space to figure this out.
> Keep track. Also != 0 would be better than ? true : false.
>

After gluing scan of old and new buckets in _hash_read* API's, this is
no more required.

> + /*
> + * setting hashso_skip_moved_tuples to false
> + * ensures that we don't check for tuples that are
> + * moved by split in old bucket and it also
> + * ensures that we won't retry to scan the old
> + * bucket once the scan for same is finished.
> + */
> + so->hashso_skip_moved_tuples = false;
>
> I think you've got a big problem here. Suppose the user starts the
> scan in the new bucket and runs it forward until they end up in the
> old bucket. Then they turn around and run the scan backward. When
> they reach the beginning of the old bucket, they're going to stop, not
> move back to the new bucket, AFAICS. Oops.
>
> _hash_first() has a related problem: a backward scan starts at the end
> of the new bucket and moves backward, but it should start at the end
> of the old bucket, and then when it reaches the beginning, flip to the
> new bucket and move backward through that one. Otherwise, a backward
> scan and a forward scan don't return tuples in opposite order, which
> they should.
>
> I think what you need to do to fix both of these problems is a more
> thorough job gluing the two buckets together. I'd suggest that the
> responsibility for switching between the two buckets should probably
> be given to _hash_readprev() and _hash_readnext(), because every place
> that needs to advance to the next or previous page that cares about
> this. Right now you are trying to handle it mostly in the functions
> that call those functions, but that is prone to errors of omission.
>

Changed as per this idea to change the API's and fix the problem.

> Also, I think that so->hashso_skip_moved_tuples is badly designed.
> There are two separate facts you need to know: (1) whether you are
> scanning a bucket that was still being populated at the start of your
> scan and (2) if yes, whether you are scanning the bucket being
> populated or whether you are instead scanning the corresponding "old"
> bucket. You're trying to keep track of that using one Boolean, but
> one Boolean only has two states and there are three possible states
> here.
>

Updated patch is using two boolean variables to track the bucket state.

> + if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
> + {
> +
> + /* release the lock on bucket buffer, before completing the split. */
>
> Extra blank line.
>

Removed.

> +moved-by-split flag on a tuple indicates that tuple is moved from old to new
> +bucket. The concurrent scans can skip such tuples till the split operation is
> +finished. Once the tuple is marked as moved-by-split, it will remain
> so forever
> +but that does no harm. We have intentionally not cleared it as that
> can generate
> +an additional I/O which is not necessary.
>
> The first sentence needs to start with "the" but the second sentence shouldn't.
>

Changed.

> It would be good to adjust this part a bit to more clearly explain
> that the split-in-progress and split-cleanup flags are bucket-level
> flags, while moved-by-split is a per-tuple flag. It's possible to
> figure this out from what you've written, but I think it could be more
> clear. Another thing that is strange is that the code uses THREE
> flags, bucket-being-split, bucket-being-populated, and
> needs-split-cleanup, but the README conflates the first two and uses a
> different name.
>

Updated patch to use bucket-being-split and bucket-being-populated to
explain the split operation in README. I have also changed the readme
to clearly indicate which the bucket and tuple level flags.

> +previously-acquired content lock, but not pin and repeat the process using the
>
> s/but not pin/but not the pin,/
>

Changed.

> A problem is that if a split fails partway through (eg due to insufficient
> -disk space) the index is left corrupt. The probability of that could be
> -made quite low if we grab a free page or two before we update the meta
> -page, but the only real solution is to treat a split as a WAL-loggable,
> +disk space or crash) the index is left corrupt. The probability of that
> +could be made quite low if we grab a free page or two before we update the
> +meta page, but the only real solution is to treat a split as a WAL-loggable,
> must-complete action. I'm not planning to teach hash about WAL in this
> -go-round.
> +go-round. However, we do try to finish the incomplete splits during insert
> +and split.
>
> I think this paragraph needs a much heavier rewrite explaining the new
> incomplete split handling. It's basically wrong now. Perhaps replace
> it with something like this:
>
> --
> If a split fails partway through (e.g. due to insufficient disk space
> or an interrupt), the index will not be corrupted. Instead, we'll
> retry the split every time a tuple is inserted into the old bucket
> prior to inserting the new tuple; eventually, we should succeed. The
> fact that a split is left unfinished doesn't prevent subsequent
> buckets from being split, but we won't try to split the bucket again
> until the prior split is finished. In other words, a bucket can be in
> the middle of being split for some time, but ti can't be in the middle
> of two splits at the same time.
>
> Although we can survive a failure to split a bucket, a crash is likely
> to corrupt the index, since hash indexes are not yet WAL-logged.
> --
>

s/ti/it
Fixed the typo and used the suggested text in README.

> + Acquire cleanup lock on target bucket
> + Scan and remove tuples
> + For overflow page, first we need to lock the next page and then
> + release the lock on current bucket or overflow page
> + Ensure to have buffer content lock in exclusive mode on bucket page
> + If buffer pincount is one, then compact free space as needed
> + Release lock
>
> I don't think this summary is particularly correct. You would never
> guess from this that we lock each bucket page in turn and then go back
> and try to relock the primary bucket page at the end. It's more like:
>
> acquire cleanup lock on primary bucket page
> loop:
> scan and remove tuples
> if this is the last bucket page, break out of loop
> pin and x-lock next page
> release prior lock and pin (except keep pin on primary bucket page)
> if the page we have locked is not the primary bucket page:
> release lock and take exclusive lock on primary bucket page
> if there are no other pins on the primary bucket page:
> squeeze the bucket to remove free space
>

Yeah, it is clear, so I have used it in README.

> Come to think of it, I'm a little worried about the locking in
> _hash_squeezebucket(). It seems like we drop the lock on each "write"
> bucket page before taking the lock on the next one. So a concurrent
> scan could get ahead of the cleanup process. That would be bad,
> wouldn't it?
>

As discussed [2], I have changed the code to use lock-chaining during
squeeze phase.

Apart from above, I have fixed a bug in calculation of lowmask in
_hash_get_oldblock_from_newbucket().

[1] - https://www.postgresql.org/message-id/CAA4eK1JJDWFY0_Ezs4ZxXgnrGtTn48vFuXniOLmL7FOWX-tKNw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1J%2B0OYWKswWYNEjrBk3LfGpGJ9iSV8bYPQ3M%3D-qpkMtwQ
%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
concurrent_hash_index_v12.patch application/octet-stream 105.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Law 2016-11-23 14:20:20 Re: [HACKERS] switching documentation build to XSLT
Previous Message Karl O. Pinc 2016-11-23 13:47:00 Re: Patch to implement pg_current_logfile() function