Re: Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-09-15 05:32:35
Message-ID: CAA4eK1+Yxoe4PWR8mKSKTNQE1v3xF=YEAnK==NixHbU-zi31=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 13, 2016 at 10:01 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Wed, Sep 7, 2016 at 9:32 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>
>
>>
>> Now, I think we shouldn't be
>> inconsistent in using them. I will change to make it same if I find
>> any inconsistency based on what you or other people think is the
>> better way to refer overflow space.
>
>
> I think just "overflow page" or "buffer containing the overflow page".
>

Okay changed to overflow page.

>
> Here are some more notes I've taken, mostly about the README and comments.
>
> It took me a while to understand that once a tuple is marked as moved by
> split, it stays that way forever. It doesn't mean "recently moved by
> split", but "ever moved by split". Which works, but is rather subtle.
> Perhaps this deserves a parenthetical comment in the README the first time
> the flag is mentioned.
>

I have added an additional paragraph explaining move-by-split flag
along with the explanation of split operation.

> ========
>
> #define INDEX_SIZE_MASK 0x1FFF
> /* bit 0x2000 is not used at present */
>
> This is no longer true, maybe:
> /* bit 0x2000 is reserved for index-AM specific usage */
>

Changed as per suggestion.

> ========
>
> Note that this is designed to allow concurrent splits and scans. If a
> split occurs, tuples relocated into the new bucket will be visited twice
> by the scan, but that does no harm. As we are releasing the locks during
> scan of a bucket, it will allow concurrent scan to start on a bucket and
> ensures that scan will always be behind cleanup.
>
> Above, the abrupt transition from splits (first sentence) to cleanup is
> confusing. If the cleanup referred to is vacuuming, it should be a new
> paragraph or at least have a transition sentence. Or is it referring to
> clean-up locks used for control purposes, rather than for actual vacuum
> clean-up? I think it is the first one, the vacuum.
>

Yes, it is first one.

> (I find the committed
> version of this comment confusing as well--how in the committed code would a
> tuple be visited twice, and why does that not do harm in the committed
> coding? So maybe the issue here is me, not the comment.)
>

You have to read this scan as scan during vacuum. Whatever is written
in committed code is right, let me try to explain with example.
Suppose, there are two buckets at the start of vacuum, after it
completes the vacuuming of first bucket and before or during vacuum
for second bucket, a split for
first bucket occurs. Now we have three buckets. If you notice in
code (hashbulkdelete), after completing the vacuum for first and
second bucket, if there is a split it will perform the vacuum for
third bucket as well. This is the reason why readme mention's that
tuples relocated into the new bucket will be visited twice.

This whole explanation is in garbage collection section, so to me it
looks clear. However, I have changed some wording, see if it makes
sense to you now.

>
> =======
>
> +Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
> tuples
> +that are moved due to split. The need for cleanup lock to remove dead
> tuples
> +is to ensure that scans' returns correct results. Scan that returns
> multiple
> +tuples from the same bucket page always restart the scan from the previous
> +offset number from which it has returned last tuple.
>
> Perhaps it would be better to teach scans to restart anywhere on the page,
> than to force more cleanup locks to be taken?
>

Yeah, we can do that by making hash index scans work-a-page-at-a-time
as we do for btree scans. However, as mentioned earlier, this is in
my Todo list and I think it is better to do it as a separate patch
based on this work. Do you think thats reasonable or do you have some
strong reason why we should consider it as part of this patch only?

> =======
> This comment no longer seems accurate (as long as it is just an ERROR and
> not a PANIC):
>
> * XXX we have a problem here if we fail to get space for a
> * new overflow page: we'll error out leaving the bucket
> split
> * only partially complete, meaning the index is corrupt,
> * since searches may fail to find entries they should find.
>
> The split will still be marked as being in progress, so any scanner will
> have to scan the old page and see the tuple there.
>

I have removed that part of comment. I think for PANIC case anyway
hash index will be corrupt, so we might not need to mention anything
about it.

> ========
> in _hash_splitbucket comments, this needs updating:
>
> * The caller must hold exclusive locks on both buckets to ensure that
> * no one else is trying to access them (see README).
>
> The true prereq here is a buffer clean up lock (pin plus exclusive buffer
> content lock), correct?
>

Right and I have changed it accordingly.

> And then:
>
> * Split needs to hold pin on primary bucket pages of both old and new
> * buckets till end of operation.
>
> 'retain' is probably better than 'hold', to emphasize that we are dropping
> the buffer content lock part of the clean-up lock, but that the pin part of
> it is kept continuously (this also matches the variable name used in the
> code).

Okay, changed to retain.

> Also, the paragraph after that one seems to be obsolete and
> contradictory with the newly added comments.
>

Are you talking about:
* In addition, the caller must have created the new bucket's base page,
..

If yes, then I think that is valid. That paragraph mainly highlights
two points. First is the new bucket's base page should be pinned,
write-locked before calling this API and both will be released in this
API. Second is we must do _hash_getnewbuf() before releasing the
metapage write lock. Both the points still seems to be valid.

> ===========
>
> /*
> * Acquiring cleanup lock to clear the split-in-progress flag ensures
> that
> * there is no pending scan that has seen the flag after it is cleared.
> */
>
> But, we are not acquiring a clean up lock. We already have a pin, and we do
> acquire a write buffer-content lock, but don't observe that our pin is the
> only one. I don't see why it is necessary to have a clean up lock (what
> harm is done if a under-way scan thinks it is scanning a bucket that is
> being split when it actually just finished the split?), but if it is
> necessary then I think this code is wrong. If not necessary, the comment is
> wrong.
>

The comment is wrong and I have removed it. This is ramanant of some
previous idea which I wanted to try but found problems in it and
didin't pursued it.

> Also, why must we hold a write lock on both old and new primary bucket pages
> simultaneously? Is this in anticipation of the WAL patch?

Yes, clearing the flag on both the buckets needs to be an atomic
operation. Otherwise also, it is not good to write two different WAL
records (one for clearing the flag on old bucket and other on new
bucket).

> The contract for
> the function does say that it returns both pages write locked, but I don't
> see a reason for that part of the contract at the moment.
>

Just refer it's usage in _hash_finish_split() cleanup flow. The
reason is that we need to retain the lock in one of the buckets
depending on the case.

> =========
>
> To avoid deadlock between readers and inserters, whenever there is a need
> to lock multiple buckets, we always take in the order suggested in
> Locking
> Definitions above. This algorithm allows them a very high degree of
> concurrency.
>
> The section referred to is actually spelled "Lock Definitions", no "ing".
>
> The Lock Definitions sections doesn't mention the meta page at all.

Okay, changed.

> I think
> there needs be something added to it about how the meta page gets locked and
> why that is deadlock free. (But we could be optimistic and assume the patch
> to implement caching of the metapage will go in and will take care of that).
>

I don't think caching the meta page will eliminate the need to lock
the meta page. However, this patch has not changed anything relavant
in meta page locking that can impact deadlock detection. I have
thought about it but not sure what more to write other than what is
already mentioned at different places about meta page in README. Let
me know, if you have something specific in mind.

> =========
>
> And an operational question on this: A lot of stuff is done conditionally
> here. Under high concurrency, do splits ever actually occur? It seems like
> they could easily be permanently starved.
>

May be, but the situation won't be worse than what we have in head.
Under high concurrency also, it can arise only if there is always a
reader for a bucket, before we try to split. Point to note here is
once the split is started, concurrent readers are allowed which was
not allowed previously. I
think the same argument can be applied to other places where readers
and writers contend for same lock, example procarraylock. In such
cases theoretically readers can starve writers for ever, but
practically such situations are rare.

Apart from fixing above review comments, I have fixed the issue
reported by Ashutosh Sharma [1].

Many thanks Jeff for the detailed review.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BfMUpJoAp5MXKRSv9193JXn25qtG%2BZrYUwb4dUuqmHrA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
concurrent_hash_index_v7.patch application/octet-stream 103.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-09-15 05:41:41 Re: Hash Indexes
Previous Message Ashutosh Bapat 2016-09-15 04:44:44 Re: [HACKERS] Error running custom plugin: “output plugins have to declare the _PG_output_plugin_init symbol”