Re: Hash Indexes

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-11-24 07:49:07
Message-ID: CAE9k0PnXBGwPKXxG+hF8dXR-qotFp-yk00thiHxG7h_QsF=oHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

I have executed few test-cases to validate the v12 patch for
concurrent hash index shared upthread and have found no issues. Below
are some of the test-cases i ran,

1) pgbench test on a read-write workload with following configuration
(This was basically to validate the locking strategy not for
performance testing)

postgresql non-default configuration:
----------------------------------------------------
min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout=900
maintenance_work_mem=1GB
checkpoint_completion_target=0.9
max_connections=200
shared buffer=8GB

pgbench settings:
-------------------------
Scale Factor=300
run time= 30 mins
pgbench -c $thread -j $thread -T $time_for_reading -M prepared postgres

2) As v12 patch mainly has locking changes related to bucket squeezing
in hash index, I have ran a small test-case to build hash index with
good number of overflow pages and then ran deletion operation to see
if the bucket squeezing has happened. The test script
"test_squeezeb_hindex.sh" used for this testing is attached with this
mail and the results are shown below:

=====Number of bucket and overflow pages before delete=====
274671 Tuples only is on.
148390
131263 bucket
17126 overflow
1 bitmap

=====Number of bucket and overflow pages after delete=====
274671 Tuples only is on.
141240
131263 bucket
9976 overflow
1 bitmap

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, Nov 23, 2016 at 7:20 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> You are right and I have changed the code as per your suggestion.
>>
>> So...
>>
>> + /*
>> + * We always maintain the pin on bucket page for whole scan operation,
>> + * so releasing the additional pin we have acquired here.
>> + */
>> + if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
>> + _hash_dropbuf(rel, *bufp);
>>
>> This relies on the page contents to know whether we took a pin; that
>> seems like a bad plan. We need to know intrinsically whether we took
>> a pin.
>>
>
> Okay, changed to not rely on page contents.
>
>> + * If the bucket split is in progress, then we need to skip tuples that
>> + * are moved from old bucket. To ensure that vacuum doesn't clean any
>> + * tuples from old or new buckets till this scan is in progress, maintain
>> + * a pin on both of the buckets. Here, we have to be cautious about
>>
>> It wouldn't be a problem if VACUUM removed tuples from the new bucket,
>> because they'd have to be dead anyway. It also wouldn't be a problem
>> if it removed tuples from the old bucket that were actually dead. The
>> real issue isn't vacuum anyway, but the process of cleaning up after a
>> split. We need to hold the pin so that tuples being moved from the
>> old bucket to the new bucket by the split don't get removed from the
>> old bucket until our scan is done.
>>
>
> Updated comments to explain clearly.
>
>> + old_blkno = _hash_get_oldblock_from_newbucket(rel,
>> opaque->hasho_bucket);
>>
>> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
>> feel like I'm repeating this ad nauseum, but I really think it's bad
>> to rely on the special space instead of our own local variables!
>>
>
> Okay, changed as per suggestion.
>
>> - /* we ran off the end of the bucket without finding a match */
>> + /*
>> + * We ran off the end of the bucket without finding a match.
>> + * Release the pin on bucket buffers. Normally, such pins are
>> + * released at end of scan, however scrolling cursors can
>> + * reacquire the bucket lock and pin in the same scan multiple
>> + * times.
>> + */
>> *bufP = so->hashso_curbuf = InvalidBuffer;
>> ItemPointerSetInvalid(current);
>> + _hash_dropscanbuf(rel, so);
>>
>> I think this comment is saying that we'll release the pin on the
>> primary bucket page for now, and then reacquire it later if the user
>> reverses the scan direction. But that doesn't sound very safe,
>> because the bucket could be split in the meantime and the order in
>> which tuples are returned could change. I think we want that to
>> remain stable within a single query execution.
>>
>
> As explained [1], this shouldn't be a problem.
>
>> + _hash_readnext(rel, &buf, &page, &opaque,
>> + (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
>>
>> Same comment: don't rely on the special space to figure this out.
>> Keep track. Also != 0 would be better than ? true : false.
>>
>
> After gluing scan of old and new buckets in _hash_read* API's, this is
> no more required.
>
>> + /*
>> + * setting hashso_skip_moved_tuples to false
>> + * ensures that we don't check for tuples that are
>> + * moved by split in old bucket and it also
>> + * ensures that we won't retry to scan the old
>> + * bucket once the scan for same is finished.
>> + */
>> + so->hashso_skip_moved_tuples = false;
>>
>> I think you've got a big problem here. Suppose the user starts the
>> scan in the new bucket and runs it forward until they end up in the
>> old bucket. Then they turn around and run the scan backward. When
>> they reach the beginning of the old bucket, they're going to stop, not
>> move back to the new bucket, AFAICS. Oops.
>>
>> _hash_first() has a related problem: a backward scan starts at the end
>> of the new bucket and moves backward, but it should start at the end
>> of the old bucket, and then when it reaches the beginning, flip to the
>> new bucket and move backward through that one. Otherwise, a backward
>> scan and a forward scan don't return tuples in opposite order, which
>> they should.
>>
>> I think what you need to do to fix both of these problems is a more
>> thorough job gluing the two buckets together. I'd suggest that the
>> responsibility for switching between the two buckets should probably
>> be given to _hash_readprev() and _hash_readnext(), because every place
>> that needs to advance to the next or previous page that cares about
>> this. Right now you are trying to handle it mostly in the functions
>> that call those functions, but that is prone to errors of omission.
>>
>
> Changed as per this idea to change the API's and fix the problem.
>
>> Also, I think that so->hashso_skip_moved_tuples is badly designed.
>> There are two separate facts you need to know: (1) whether you are
>> scanning a bucket that was still being populated at the start of your
>> scan and (2) if yes, whether you are scanning the bucket being
>> populated or whether you are instead scanning the corresponding "old"
>> bucket. You're trying to keep track of that using one Boolean, but
>> one Boolean only has two states and there are three possible states
>> here.
>>
>
> Updated patch is using two boolean variables to track the bucket state.
>
>> + if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
>> + {
>> +
>> + /* release the lock on bucket buffer, before completing the split. */
>>
>> Extra blank line.
>>
>
> Removed.
>
>> +moved-by-split flag on a tuple indicates that tuple is moved from old to new
>> +bucket. The concurrent scans can skip such tuples till the split operation is
>> +finished. Once the tuple is marked as moved-by-split, it will remain
>> so forever
>> +but that does no harm. We have intentionally not cleared it as that
>> can generate
>> +an additional I/O which is not necessary.
>>
>> The first sentence needs to start with "the" but the second sentence shouldn't.
>>
>
> Changed.
>
>> It would be good to adjust this part a bit to more clearly explain
>> that the split-in-progress and split-cleanup flags are bucket-level
>> flags, while moved-by-split is a per-tuple flag. It's possible to
>> figure this out from what you've written, but I think it could be more
>> clear. Another thing that is strange is that the code uses THREE
>> flags, bucket-being-split, bucket-being-populated, and
>> needs-split-cleanup, but the README conflates the first two and uses a
>> different name.
>>
>
> Updated patch to use bucket-being-split and bucket-being-populated to
> explain the split operation in README. I have also changed the readme
> to clearly indicate which the bucket and tuple level flags.
>
>> +previously-acquired content lock, but not pin and repeat the process using the
>>
>> s/but not pin/but not the pin,/
>>
>
> Changed.
>
>> A problem is that if a split fails partway through (eg due to insufficient
>> -disk space) the index is left corrupt. The probability of that could be
>> -made quite low if we grab a free page or two before we update the meta
>> -page, but the only real solution is to treat a split as a WAL-loggable,
>> +disk space or crash) the index is left corrupt. The probability of that
>> +could be made quite low if we grab a free page or two before we update the
>> +meta page, but the only real solution is to treat a split as a WAL-loggable,
>> must-complete action. I'm not planning to teach hash about WAL in this
>> -go-round.
>> +go-round. However, we do try to finish the incomplete splits during insert
>> +and split.
>>
>> I think this paragraph needs a much heavier rewrite explaining the new
>> incomplete split handling. It's basically wrong now. Perhaps replace
>> it with something like this:
>>
>> --
>> If a split fails partway through (e.g. due to insufficient disk space
>> or an interrupt), the index will not be corrupted. Instead, we'll
>> retry the split every time a tuple is inserted into the old bucket
>> prior to inserting the new tuple; eventually, we should succeed. The
>> fact that a split is left unfinished doesn't prevent subsequent
>> buckets from being split, but we won't try to split the bucket again
>> until the prior split is finished. In other words, a bucket can be in
>> the middle of being split for some time, but ti can't be in the middle
>> of two splits at the same time.
>>
>> Although we can survive a failure to split a bucket, a crash is likely
>> to corrupt the index, since hash indexes are not yet WAL-logged.
>> --
>>
>
> s/ti/it
> Fixed the typo and used the suggested text in README.
>
>> + Acquire cleanup lock on target bucket
>> + Scan and remove tuples
>> + For overflow page, first we need to lock the next page and then
>> + release the lock on current bucket or overflow page
>> + Ensure to have buffer content lock in exclusive mode on bucket page
>> + If buffer pincount is one, then compact free space as needed
>> + Release lock
>>
>> I don't think this summary is particularly correct. You would never
>> guess from this that we lock each bucket page in turn and then go back
>> and try to relock the primary bucket page at the end. It's more like:
>>
>> acquire cleanup lock on primary bucket page
>> loop:
>> scan and remove tuples
>> if this is the last bucket page, break out of loop
>> pin and x-lock next page
>> release prior lock and pin (except keep pin on primary bucket page)
>> if the page we have locked is not the primary bucket page:
>> release lock and take exclusive lock on primary bucket page
>> if there are no other pins on the primary bucket page:
>> squeeze the bucket to remove free space
>>
>
> Yeah, it is clear, so I have used it in README.
>
>> Come to think of it, I'm a little worried about the locking in
>> _hash_squeezebucket(). It seems like we drop the lock on each "write"
>> bucket page before taking the lock on the next one. So a concurrent
>> scan could get ahead of the cleanup process. That would be bad,
>> wouldn't it?
>>
>
> As discussed [2], I have changed the code to use lock-chaining during
> squeeze phase.
>
>
> Apart from above, I have fixed a bug in calculation of lowmask in
> _hash_get_oldblock_from_newbucket().
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1JJDWFY0_Ezs4ZxXgnrGtTn48vFuXniOLmL7FOWX-tKNw%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/CAA4eK1J%2B0OYWKswWYNEjrBk3LfGpGJ9iSV8bYPQ3M%3D-qpkMtwQ
> %40mail.gmail.com
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Attachment Content-Type Size
test_squeeze_hindex.sh application/x-sh 1.0 KB
test_squeeze_hindex.sql application/sql 851.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2016-11-24 07:57:20 Re: Push down more full joins in postgres_fdw
Previous Message Michael Paquier 2016-11-24 07:46:19 Re: [PATCH] Reload SSL certificates on SIGHUP