Re: Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-09-08 04:32:44
Message-ID: CAA4eK1+i4SGP4_XHqdqW-VT9tvUR7B0oy2BjYQVdk1aY1mD3mQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>>
>> I have fixed all other issues you have raised. Updated patch is
>> attached with this mail.
>
>
> I am finding the comments (particularly README) quite hard to follow. There
> are many references to an "overflow bucket", or similar phrases. I think
> these should be "overflow pages". A bucket is a conceptual thing consisting
> of a primary page for that bucket and zero or more overflow pages for the
> same bucket. There are no overflow buckets, unless you are referring to the
> new bucket to which things are being moved.
>

Hmm. I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology. I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1],
search for overflow on that wiki page. Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.

> Was maintaining on-disk compatibility a major concern for this patch? Would
> you do things differently if that were not a concern?
>

I would not have done much differently from what it is now, however
one thing I have considered during development was to change the hash
index tuple structure as below to mark the index tuples as
move-by-split:

typedef struct
{
IndexTuple entry; /* tuple to insert */
bool moved_by_split;
} HashEntryData;

The other alternative was to use the (unused) bit in IndexTupleData->tinfo.

I have chosen the later approach, now one could definitely argue that
it is the last available bit in IndexTuple and using it for hash
indexes might or might not be best thing to do. However, I think it
is also not advisable to break the compatibility if we can use some
existing bit. In any case, the same question can arise whenever
anyone wants to use it for some other purpose.

> In particular, I am thinking about the need for every insert to
> exclusive-content-lock the meta page to increment the index-wide tuple
> count.

This is not what this patch has changed. The main purpose of this
patch is to change heavy-weight locking to light-weight locking and
provide a way to handle the in-complete splits, both of which are
required to sensibly write WAL for hash-indexes. Having said that, I
agree with your point that we can improve the insertion logic, so that
we don't need to Write-lock the meta-page at each insert. I have
noticed some other improvements in hash indexes as well during this
work like caching the meta page, reduce lock/unlock calls for
retrieving tuples from a page by making hash index scans work a page
at a time as we do for btree scans, kill_prior_tuple mechanism is
current quite naive and needs improvement and the biggest improvement
is needed in create index logic where we are inserting tuple-by-tuple
whereas btree operates at page level and also by-passes the shared
buffers. One of such improvements (cache the meta page) is already
being worked upon by my colleague and the patch [2] for same is in CF.
The main point I want to high light is that apart from what this patch
does, there are number of other potential areas which needs
improvements in hash indexes and I think it is better to do those as
separate enhancements rather than as a single patch.

> I think that this is going to be a huge bottleneck on update
> intensive workloads (which I don't believe have been performance tested as
> of yet).

I have done some performance testing with this patch and I find there
was a significant improvement as compare to what we have now in hash
indexes even for read-write workload. I think the better idea is to
compare it with btree, but in any case, even if this proves to be a
bottleneck, we should try to improve it in a separate patch rather
than as a part of this patch.

> I was wondering if we might not want to change that so that each
> bucket keeps a local count, and sweeps that up to the meta page only when it
> exceeds a threshold. But this would require the bucket page to have an area
> to hold such a count. Another idea would to keep not a count of tuples, but
> of buckets with at least one overflow page, and split when there are too
> many of those.

I think both of these ideas could lead to change the point (tuple
count) where we currently split. This might impact the search speed
and space usage. Yet another alternative could be to change
hashm_ntuples to 64bit and use 64-bit atomics to operate on it or may
be use a separate spin-lock to protect it. However, whatever we
decide to do with it, I think it is a matter of separate patch.

Thanks for looking into patch.

[1] - https://en.wikipedia.org/wiki/Linear_hashing
[2] - https://commitfest.postgresql.org/10/715/

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-09-08 04:37:44 Re: Write Ahead Logging for Hash Indexes
Previous Message Mark Kirkwood 2016-09-08 04:32:02 Re: Write Ahead Logging for Hash Indexes