Re: Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-10-24 14:30:28
Message-ID: CAA4eK1J57GJbKU4TU5HQCg4fdestfOmNmWXeGiKPS=Spe4v3EA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Amit, can you please split the buffer manager changes in this patch
> into a separate patch?
>

Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that.

> I think those changes can be committed first
> and then we can try to deal with the rest of it. Instead of adding
> ConditionalLockBufferShared, I think we should add an "int mode"
> argument to the existing ConditionalLockBuffer() function. That way
> is more consistent with LockBuffer(). It means an API break for any
> third-party code that's calling this function, but that doesn't seem
> like a big problem.

That was the reason I have chosen to write separate API, but now I
have changed it as per your suggestion.

> As for CheckBufferForCleanup, I think that looks OK, but: (1) please
> add an Assert() that we hold an exclusive lock on the buffer, using
> LWLockHeldByMeInMode; and (2) I think we should rename it to something
> like IsBufferCleanupOK. Then, when it's used, it reads like English:
> if (IsBufferCleanupOK(buf)) { /* clean up the buffer */ }.

Changed as per suggestion.

>> I'll write another email with my thoughts about the rest of the patch.
>
> I think that the README changes for this patch need a fairly large
> amount of additional work. Here are a few things I notice:
>
> - The confusion between buckets and pages hasn't been completely
> cleared up. If you read the beginning of the README, the terminology
> is clearly set forth. It says:
>
>>> A hash index consists of two or more "buckets", into which tuples are placed whenever their hash key maps to the bucket number. Each bucket in the hash index comprises one or more index pages. The bucket's first page is permanently assigned to it when the bucket is created. Additional pages, called "overflow pages", are added if the bucket receives too many tuples to fit in the primary bucket page."
>
> But later on, you say:
>
>>> Scan will take a lock in shared mode on the primary bucket or on one of the overflow page.
>
> So the correct terminology here would be "primary bucket page" not
> "primary bucket".
>
> - In addition, notice that there are two English errors in this
> sentence: the word "the" needs to be added to the beginning of the
> sentence, and the last word needs to be "pages" rather than "page".
> There are a considerable number of similar minor errors; if you can't
> fix them, I'll make a pass over it and clean it up.
>

I have tried to fix as per above suggestion, but I think may be some
more work is needed.

> - The whole "lock definitions" section seems to me to be pretty loose
> and imprecise about what is happening. For example, it uses the term
> "split-in-progress" without first defining it. The sentence quoted
> above says that scans take a lock in shared mode either on the primary
> page or on one of the overflow pages, but it's not to document code by
> saying that it will do either A or B without explaining which one! In
> fact, I think that a scan will take a content lock first on the
> primary bucket page and then on each overflow page in sequence,
> retaining a pin on the primary buffer page throughout the scan. So it
> is not one or the other but both in a particular sequence, and that
> can and should be explained.
>
> Another problem with this section is that even when it's precise about
> what is going on, it's probably duplicating what is or should be in
> the following sections where the algorithms for each operation are
> explained. In the original wording, this section explains what each
> lock protects, and then the following sections explain the algorithms
> in the context of those definitions. Now, this section contains a
> sketch of the algorithm, and then the following sections lay it out
> again in more detail. The question of what each lock protects has
> been lost. Here's an attempt at some text to replace what you have
> here:
>
> ===
> Concurrency control for hash indexes is provided using buffer content
> locks, buffer pins, and cleanup locks. Here as elsewhere in
> PostgreSQL, cleanup lock means that we hold an exclusive lock on the
> buffer and have observed at some point after acquiring the lock that
> we hold the only pin on that buffer. For hash indexes, a cleanup lock
> on a primary bucket page represents the right to perform an arbitrary
> reorganization of the entire bucket, while a cleanup lock on an
> overflow page represents the right to perform a reorganization of just
> that page. Therefore, scans retain a pin on both the primary bucket
> page and the overflow page they are currently scanning, if any.
> Splitting a bucket requires a cleanup lock on both the old and new
> primary bucket pages. VACUUM therefore takes a cleanup lock on every
> bucket page in turn order to remove tuples. It can also remove tuples
> copied to a new bucket by any previous split operation, because the
> cleanup lock taken on the primary bucket page guarantees that no scans
> which started prior to the most recent split can still be in progress.
> After cleaning each page individually, it attempts to take a cleanup
> lock on the primary bucket page in order to "squeeze" the bucket down
> to the minimum possible number of pages.
> ===
>

Changed as per suggestion.

> As I was looking at the old text regarding deadlock risk, I realized
> what may be a serious problem. Suppose process A is performing a scan
> of some hash index. While the scan is suspended, it attempts to take
> a lock and is blocked by process B. Process B, meanwhile, is running
> VACUUM on that hash index. Eventually, it will do
> LockBufferForCleanup() on the hash bucket on which process A holds a
> buffer pin, resulting in an undetected deadlock. In the current
> coding, A would hold a heavyweight lock and B would attempt to acquire
> a conflicting heavyweight lock, and the deadlock detector would kill
> one of them. This patch probably breaks that. I notice that that's
> the only place where we attempt to acquire a buffer cleanup lock
> unconditionally; every place else, we acquire the lock conditionally,
> so there's no deadlock risk. Once we resolve this problem, the
> paragraph about deadlock risk in this section should be revised to
> explain whatever solution we come up with.
>
> By the way, since VACUUM must run in its own transaction, B can't be
> holding arbitrary locks, but that doesn't seem quite sufficient to get
> us out of the woods. It will at least hold ShareUpdateExclusiveLock
> on the relation being vacuumed, and process A could attempt to acquire
> that same lock.
>

As discussed [1] that this risk exists for btree, so leaving it as it
is for now.

> Also in regards to deadlock, I notice that you added a paragraph
> saying that we lock higher-numbered buckets before lower-numbered
> buckets. That's fair enough, but what about the metapage?
>

Updated README with regard to metapage as well.

The reader
> algorithm suggests that the metapage must lock must be taken after the
> bucket locks, because it tries to grab the bucket lock conditionally
> after acquiring the metapage lock, but that's not documented here.
>
> The reader algorithm itself seems to be a bit oddly explained.
>
> pin meta page and take buffer content lock in shared mode
> + compute bucket number for target hash key
> + read and pin the primary bucket page
>
> So far, I'm with you.
>
> + conditionally get the buffer content lock in shared mode on
> primary bucket page for search
> + if we didn't get the lock (need to wait for lock)
>
> "didn't get the lock" and "wait for the lock" are saying the same
> thing, so this is redundant, and the statement that it is "for search"
> on the previous line is redundant with the introductory text
> describing this as the reader algorithm.
>
> + release the buffer content lock on meta page
> + acquire buffer content lock on primary bucket page in shared mode
> + acquire the buffer content lock in shared mode on meta page
>
> OK...
>
> + to check for possibility of split, we need to recompute the bucket and
> + verify, if it is a correct bucket; set the retry flag
>
> This makes it sound like we set the retry flag whether it was the
> correct bucket or not, which isn't sensible.
>
> + else if we get the lock, then we can skip the retry path
>
> This line is totally redundant. If we don't set the retry flag, then
> of course we can skip the part guarded by if (retry).
>
> + if (retry)
> + loop:
> + compute bucket number for target hash key
> + release meta page buffer content lock
> + if (correct bucket page is already locked)
> + break
> + release any existing content lock on bucket page (if a
> concurrent split happened)
> + pin primary bucket page and take shared buffer content lock
> + retake meta page buffer content lock in shared mode
>
> This is the part I *really* don't understand. It makes sense to me
> that we need to loop until we get the correct bucket locked with no
> concurrent splits, but why is this retry loop separate from the
> previous bit of code that set the retry flag. In other words, why is
> not something like this?
>
> pin the meta page and take shared content lock on it
> compute bucket number for target hash key
> if (we can't get a shared content lock on the target bucket without blocking)
> loop:
> release meta page content lock
> take a shared content lock on the target primary bucket page
> take a shared content lock on the metapage
> if (previously-computed target bucket has not been split)
> break;
>
> Another thing I don't quite understand about this algorithm is that in
> order to conditionally lock the target primary bucket page, we'd first
> need to read and pin it. And that doesn't seem like a good thing to
> do while we're holding a shared content lock on the metapage, because
> of the principle that we don't want to hold content locks across I/O.
>

I have changed it such that we don't perform I/O across content lock,
but that needs to lock metapage twice which will hurt performance, but
we can buy back that performance with caching the metapage [2].
Updated the readme accordingly.

> -- then, per read request:
> release pin on metapage
> - read current page of bucket and take shared buffer content lock
> - step to next page if necessary (no chaining of locks)
> + if the split is in progress for current bucket and this is a new bucket
> + release the buffer content lock on current bucket page
> + pin and acquire the buffer content lock on old bucket in shared mode
> + release the buffer content lock on old bucket, but not pin
> + retake the buffer content lock on new bucket
> + mark the scan such that it skips the tuples that are marked
> as moved by split
>
> Aren't these steps done just once per scan? If so, I think they
> should appear before "-- then, per read request" which AIUI is
> intended to imply a loop over tuples.
>
> + step to next page if necessary (no chaining of locks)
> + if the scan indicates moved by split, then move to old bucket
> after the scan
> + of current bucket is finished
> get tuple
> release buffer content lock and pin on current page
> -- at scan shutdown:
> - release bucket share-lock
>
> Don't we have a pin to release at scan shutdown in the new system?
>

Already replied to this point in previous e-mail.

> Well, I was hoping to get through the whole patch in one email, but
> I'm not even all the way through the README. However, it's late, so
> I'm stopping here for now.
>

Thanks for the valuable feedback.

[1] - https://www.postgresql.org/message-id/CA%2BTgmoZWH0L%3DmEq9-7%2Bo-yogbXqDhF35nERcK4HgjCoFKVbCkA%40mail.gmail.com
[2] - https://commitfest.postgresql.org/11/715/

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
extend_bufmgr_api_for_hash_index_v1.patch application/octet-stream 7.1 KB
concurrent_hash_index_v9.patch application/octet-stream 99.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-10-24 14:36:51 Re: Hash Indexes
Previous Message Kevin Grittner 2016-10-24 13:52:41 Re: On conflict update & hint bits