Re: Hash Indexes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-11-16 21:38:30
Message-ID: CA+Tgmoac-FjFRoHRABOyfYf1t1fQHHZ84=UfyHk_bESouZw=PA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> You are right and I have changed the code as per your suggestion.

So...

+ /*
+ * We always maintain the pin on bucket page for whole scan operation,
+ * so releasing the additional pin we have acquired here.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_dropbuf(rel, *bufp);

This relies on the page contents to know whether we took a pin; that
seems like a bad plan. We need to know intrinsically whether we took
a pin.

+ * If the bucket split is in progress, then we need to skip tuples that
+ * are moved from old bucket. To ensure that vacuum doesn't clean any
+ * tuples from old or new buckets till this scan is in progress, maintain
+ * a pin on both of the buckets. Here, we have to be cautious about

It wouldn't be a problem if VACUUM removed tuples from the new bucket,
because they'd have to be dead anyway. It also wouldn't be a problem
if it removed tuples from the old bucket that were actually dead. The
real issue isn't vacuum anyway, but the process of cleaning up after a
split. We need to hold the pin so that tuples being moved from the
old bucket to the new bucket by the split don't get removed from the
old bucket until our scan is done.

+ old_blkno = _hash_get_oldblock_from_newbucket(rel,
opaque->hasho_bucket);

Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
feel like I'm repeating this ad nauseum, but I really think it's bad
to rely on the special space instead of our own local variables!

- /* we ran off the end of the bucket without finding a match */
+ /*
+ * We ran off the end of the bucket without finding a match.
+ * Release the pin on bucket buffers. Normally, such pins are
+ * released at end of scan, however scrolling cursors can
+ * reacquire the bucket lock and pin in the same scan multiple
+ * times.
+ */
*bufP = so->hashso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
+ _hash_dropscanbuf(rel, so);

I think this comment is saying that we'll release the pin on the
primary bucket page for now, and then reacquire it later if the user
reverses the scan direction. But that doesn't sound very safe,
because the bucket could be split in the meantime and the order in
which tuples are returned could change. I think we want that to
remain stable within a single query execution.

+ _hash_readnext(rel, &buf, &page, &opaque,
+ (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);

Same comment: don't rely on the special space to figure this out.
Keep track. Also != 0 would be better than ? true : false.

+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;

I think you've got a big problem here. Suppose the user starts the
scan in the new bucket and runs it forward until they end up in the
old bucket. Then they turn around and run the scan backward. When
they reach the beginning of the old bucket, they're going to stop, not
move back to the new bucket, AFAICS. Oops.

_hash_first() has a related problem: a backward scan starts at the end
of the new bucket and moves backward, but it should start at the end
of the old bucket, and then when it reaches the beginning, flip to the
new bucket and move backward through that one. Otherwise, a backward
scan and a forward scan don't return tuples in opposite order, which
they should.

I think what you need to do to fix both of these problems is a more
thorough job gluing the two buckets together. I'd suggest that the
responsibility for switching between the two buckets should probably
be given to _hash_readprev() and _hash_readnext(), because every place
that needs to advance to the next or previous page that cares about
this. Right now you are trying to handle it mostly in the functions
that call those functions, but that is prone to errors of omission.

Also, I think that so->hashso_skip_moved_tuples is badly designed.
There are two separate facts you need to know: (1) whether you are
scanning a bucket that was still being populated at the start of your
scan and (2) if yes, whether you are scanning the bucket being
populated or whether you are instead scanning the corresponding "old"
bucket. You're trying to keep track of that using one Boolean, but
one Boolean only has two states and there are three possible states
here.

+ if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
+ {
+
+ /* release the lock on bucket buffer, before completing the split. */

Extra blank line.

+moved-by-split flag on a tuple indicates that tuple is moved from old to new
+bucket. The concurrent scans can skip such tuples till the split operation is
+finished. Once the tuple is marked as moved-by-split, it will remain
so forever
+but that does no harm. We have intentionally not cleared it as that
can generate
+an additional I/O which is not necessary.

The first sentence needs to start with "the" but the second sentence shouldn't.

It would be good to adjust this part a bit to more clearly explain
that the split-in-progress and split-cleanup flags are bucket-level
flags, while moved-by-split is a per-tuple flag. It's possible to
figure this out from what you've written, but I think it could be more
clear. Another thing that is strange is that the code uses THREE
flags, bucket-being-split, bucket-being-populated, and
needs-split-cleanup, but the README conflates the first two and uses a
different name.

+previously-acquired content lock, but not pin and repeat the process using the

s/but not pin/but not the pin,/

A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt. The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt. The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action. I'm not planning to teach hash about WAL in this
-go-round.
+go-round. However, we do try to finish the incomplete splits during insert
+and split.

I think this paragraph needs a much heavier rewrite explaining the new
incomplete split handling. It's basically wrong now. Perhaps replace
it with something like this:

--
If a split fails partway through (e.g. due to insufficient disk space
or an interrupt), the index will not be corrupted. Instead, we'll
retry the split every time a tuple is inserted into the old bucket
prior to inserting the new tuple; eventually, we should succeed. The
fact that a split is left unfinished doesn't prevent subsequent
buckets from being split, but we won't try to split the bucket again
until the prior split is finished. In other words, a bucket can be in
the middle of being split for some time, but ti can't be in the middle
of two splits at the same time.

Although we can survive a failure to split a bucket, a crash is likely
to corrupt the index, since hash indexes are not yet WAL-logged.
--

+ Acquire cleanup lock on target bucket
+ Scan and remove tuples
+ For overflow page, first we need to lock the next page and then
+ release the lock on current bucket or overflow page
+ Ensure to have buffer content lock in exclusive mode on bucket page
+ If buffer pincount is one, then compact free space as needed
+ Release lock

I don't think this summary is particularly correct. You would never
guess from this that we lock each bucket page in turn and then go back
and try to relock the primary bucket page at the end. It's more like:

acquire cleanup lock on primary bucket page
loop:
scan and remove tuples
if this is the last bucket page, break out of loop
pin and x-lock next page
release prior lock and pin (except keep pin on primary bucket page)
if the page we have locked is not the primary bucket page:
release lock and take exclusive lock on primary bucket page
if there are no other pins on the primary bucket page:
squeeze the bucket to remove free space

Come to think of it, I'm a little worried about the locking in
_hash_squeezebucket(). It seems like we drop the lock on each "write"
bucket page before taking the lock on the next one. So a concurrent
scan could get ahead of the cleanup process. That would be bad,
wouldn't it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-11-16 21:48:19 Re: Re: [COMMITTERS] pgsql: Build HTML documentation using XSLT stylesheets by default
Previous Message Alvaro Herrera 2016-11-16 21:23:52 Re: Re: [COMMITTERS] pgsql: Build HTML documentation using XSLT stylesheets by default