Re: Hash Indexes

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-09-13 16:31:22
Message-ID: CAMkU=1y==EbuEhxbL_A6CoP07unaL8sQNjSL9zZw_Zb-tH5FfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 7, 2016 at 9:32 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >>
> >>
> >> I have fixed all other issues you have raised. Updated patch is
> >> attached with this mail.
> >
> >
> > I am finding the comments (particularly README) quite hard to follow.
> There
> > are many references to an "overflow bucket", or similar phrases. I think
> > these should be "overflow pages". A bucket is a conceptual thing
> consisting
> > of a primary page for that bucket and zero or more overflow pages for the
> > same bucket. There are no overflow buckets, unless you are referring to
> the
> > new bucket to which things are being moved.
> >
>
> Hmm. I think page or block is a concept of database systems and
> buckets is a general concept used in hashing technology. I think the
> difference is that there are primary buckets and overflow buckets. I
> have checked how they are referred in one of the wiki pages [1],
> search for overflow on that wiki page.

That page seems to use "slot" to refer to the primary bucket/page and all
the overflow buckets/pages which cover the same post-masked values. I
don't think that would be an improvement for us, because "slot" is already
pretty well-used for other things. Their use of "bucket" does seem to be
mostly the same as "page" (or maybe "buffer" or "block"?) but I don't think
we gain anything from creating yet another synonym for page/buffer/block.
I think the easiest thing would be to keep using the meanings which the
existed committed code uses, so that we at least have internal consistency.

> Now, I think we shouldn't be
> inconsistent in using them. I will change to make it same if I find
> any inconsistency based on what you or other people think is the
> better way to refer overflow space.
>

I think just "overflow page" or "buffer containing the overflow page".

Here are some more notes I've taken, mostly about the README and comments.

It took me a while to understand that once a tuple is marked as moved by
split, it stays that way forever. It doesn't mean "recently moved by
split", but "ever moved by split". Which works, but is rather subtle.
Perhaps this deserves a parenthetical comment in the README the first time
the flag is mentioned.

========

#define INDEX_SIZE_MASK 0x1FFF
/* bit 0x2000 is not used at present */

This is no longer true, maybe:
/* bit 0x2000 is reserved for index-AM specific usage */

========

Note that this is designed to allow concurrent splits and scans. If a
split occurs, tuples relocated into the new bucket will be visited twice
by the scan, but that does no harm. As we are releasing the locks during
scan of a bucket, it will allow concurrent scan to start on a bucket and
ensures that scan will always be behind cleanup.

Above, the abrupt transition from splits (first sentence) to cleanup is
confusing. If the cleanup referred to is vacuuming, it should be a new
paragraph or at least have a transition sentence. Or is it referring to
clean-up locks used for control purposes, rather than for actual vacuum
clean-up? I think it is the first one, the vacuum. (I find the committed
version of this comment confusing as well--how in the committed code would
a tuple be visited twice, and why does that not do harm in the committed
coding? So maybe the issue here is me, not the comment.)

=======

+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
tuples
+that are moved due to split. The need for cleanup lock to remove dead
tuples
+is to ensure that scans' returns correct results. Scan that returns
multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.

Perhaps it would be better to teach scans to restart anywhere on the page,
than to force more cleanup locks to be taken?

=======
This comment no longer seems accurate (as long as it is just an ERROR and
not a PANIC):

* XXX we have a problem here if we fail to get space for a
* new overflow page: we'll error out leaving the bucket
split
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.

The split will still be marked as being in progress, so any scanner will
have to scan the old page and see the tuple there.

========
in _hash_splitbucket comments, this needs updating:

* The caller must hold exclusive locks on both buckets to ensure that
* no one else is trying to access them (see README).

The true prereq here is a buffer clean up lock (pin plus exclusive buffer
content lock), correct?

And then:

* Split needs to hold pin on primary bucket pages of both old and new
* buckets till end of operation.

'retain' is probably better than 'hold', to emphasize that we are dropping
the buffer content lock part of the clean-up lock, but that the pin part of
it is kept continuously (this also matches the variable name used in the
code). Also, the paragraph after that one seems to be obsolete and
contradictory with the newly added comments.

===========

/*
* Acquiring cleanup lock to clear the split-in-progress flag ensures
that
* there is no pending scan that has seen the flag after it is cleared.
*/

But, we are not acquiring a clean up lock. We already have a pin, and we
do acquire a write buffer-content lock, but don't observe that our pin is
the only one. I don't see why it is necessary to have a clean up lock
(what harm is done if a under-way scan thinks it is scanning a bucket that
is being split when it actually just finished the split?), but if it is
necessary then I think this code is wrong. If not necessary, the comment
is wrong.

Also, why must we hold a write lock on both old and new primary bucket
pages simultaneously? Is this in anticipation of the WAL patch? The
contract for the function does say that it returns both pages write locked,
but I don't see a reason for that part of the contract at the moment.

=========

To avoid deadlock between readers and inserters, whenever there is a need
to lock multiple buckets, we always take in the order suggested in
Locking
Definitions above. This algorithm allows them a very high degree of
concurrency.

The section referred to is actually spelled "Lock Definitions", no "ing".

The Lock Definitions sections doesn't mention the meta page at all. I
think there needs be something added to it about how the meta page gets
locked and why that is deadlock free. (But we could be optimistic and
assume the patch to implement caching of the metapage will go in and will
take care of that).

=========

And an operational question on this: A lot of stuff is done conditionally
here. Under high concurrency, do splits ever actually occur? It seems
like they could easily be permanently starved.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-09-13 16:43:36 Re: kqueue
Previous Message Peter Geoghegan 2016-09-13 16:27:24 Re: cost_sort() may need to be updated