Re: Hash Indexes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-11-17 17:24:17
Message-ID: CA+TgmoZObCrc5A3nw-CXp9BJ6uAr1mXHY8ZieijubQSr3eQewA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Are you expecting a comment change here?
>
>> + old_blkno = _hash_get_oldblock_from_newbucket(rel,
>> opaque->hasho_bucket);
>>
>> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I
>> feel like I'm repeating this ad nauseum, but I really think it's bad
>> to rely on the special space instead of our own local variables!
>>
>
> Sure, we can pass bucket as well. However, if you see few lines below
> (while (BlockNumberIsValid(opaque->hasho_nextblkno))), we are already
> relying on special space to pass variables. In general, we are using
> special space to pass variables to functions in many other places in
> the code. What exactly are you bothered about in accessing special
> space, if it is safe to do?

I don't want to rely on the special space to know which buffers we
have locked or pinned. We obviously need the special space to find
the next and previous buffers in the block chain -- there's no other
way to know that. However, we should be more careful about locks and
pins. If the special space is corrupted in some way, we still
shouldn't get confused about which buffers we have locked or pinned.

>> I think this comment is saying that we'll release the pin on the
>> primary bucket page for now, and then reacquire it later if the user
>> reverses the scan direction. But that doesn't sound very safe,
>> because the bucket could be split in the meantime and the order in
>> which tuples are returned could change. I think we want that to
>> remain stable within a single query execution.
>
> Isn't that possible even without the patch? Basically, after reaching
> end of forward scan and for doing backward *all* scan, we need to
> perform portal rewind which will in turn call hashrescan where we will
> drop the lock on bucket and then again when we try to move cursor
> forward we acquire lock in _hash_first(), so in between when we don't
> have the lock, the split could happen and next scan results could
> differ.

Well, the existing code doesn't drop the heavyweight lock at that
location, but your patch does drop the pin that serves the same
function, so I feel like there must be some difference.

>> Also, I think that so->hashso_skip_moved_tuples is badly designed.
>> There are two separate facts you need to know: (1) whether you are
>> scanning a bucket that was still being populated at the start of your
>> scan and (2) if yes, whether you are scanning the bucket being
>> populated or whether you are instead scanning the corresponding "old"
>> bucket. You're trying to keep track of that using one Boolean, but
>> one Boolean only has two states and there are three possible states
>> here.
>
> So do you prefer to have two booleans to track those facts or have an
> uint8 with a tri-state value or something else?

I don't currently have a preference.

>> Come to think of it, I'm a little worried about the locking in
>> _hash_squeezebucket(). It seems like we drop the lock on each "write"
>> bucket page before taking the lock on the next one. So a concurrent
>> scan could get ahead of the cleanup process. That would be bad,
>> wouldn't it?
>
> Yeah, that would be bad if it happens, but no concurrent scan can
> happen during squeeze phase because we take an exclusive lock on a
> bucket page and maintain it throughout the operation.

Well, that's completely unacceptable. A major reason the current code
uses heavyweight locks is because you can't hold lightweight locks
across arbitrary amounts of work -- because, just to take one example,
a process holding or waiting for an LWLock isn't interruptible. The
point of this redesign was to get rid of that, so that LWLocks are
only held for short periods. I dislike the lock-chaining approach
(take the next lock before releasing the previous one) quite a bit and
really would like to find a way to get rid of that, but the idea of
holding a buffer lock across a complete traversal of an unbounded
number of overflow buckets is far worse. We've got to come up with a
design that doesn't require that, or else completely redesign the
bucket-squeezing stuff.

(Would it make any sense to change the order of the hash index patches
we've got outstanding? For instance, if we did the page-at-a-time
stuff first, it would make life simpler for this patch in several
ways, possibly including this issue.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2016-11-17 17:34:50 Re: Vacuum: allow usage of more than 1GB of work mem
Previous Message Amit Kapila 2016-11-17 17:05:45 Re: Hash Indexes