Re: Hash Indexes

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hash Indexes
Date: 2016-06-22 09:14:05
Message-ID: CAA4eK1+VLQzBXyrYQpL3=GCQW39vCfxmWgmQB=1Y2GB4X1w64Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 21, 2016 at 9:33 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Thu, Jun 16, 2016 at 3:28 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> >> Incomplete splits can be completed either by vacuum or insert as both
> >> needs exclusive lock on bucket. If vacuum finds split-in-progress
flag on a
> >> bucket then it will complete the split operation, vacuum won't see
this flag
> >> if actually split is in progress on that bucket as vacuum needs
cleanup lock
> >> and split retains pin till end of operation. To make it work for
Insert
> >> operation, one simple idea could be that if insert finds
split-in-progress
> >> flag, then it releases the current exclusive lock on bucket and tries
to
> >> acquire a cleanup lock on bucket, if it gets cleanup lock, then it can
> >> complete the split and then the insertion of tuple, else it will have a
> >> exclusive lock on bucket and just perform the insertion of tuple. The
> >> disadvantage of trying to complete the split in vacuum is that split
might
> >> require new pages and allocating new pages at time of vacuum is not
> >> advisable. The disadvantage of doing it at time of Insert is that
Insert
> >> might skip it even if there is some scan on the bucket is going on as
scan
> >> will also retain pin on the bucket, but I think that is not a big
deal. The
> >> actual completion of split can be done in two ways: (a) scan the new
bucket
> >> and build a hash table with all of the TIDs you find there. When
copying
> >> tuples from the old bucket, first probe the hash table; if you find a
match,
> >> just skip that tuple (idea suggested by Robert Haas offlist) (b)
delete all
> >> the tuples that are marked as moved_by_split in the new bucket and
perform
> >> the split operation from the beginning using old bucket.
> >
> > I have completed the patch with respect to incomplete splits and delayed
> > cleanup of garbage tuples. For incomplete splits, I have used the
option
> > (a) as mentioned above. The incomplete splits are completed if the
> > insertion sees split-in-progress flag in a bucket.
>
> It seems to me that there is a potential performance problem here. If
> the split is still being performed, every insert will see the
> split-in-progress flag set. The in-progress split retains only a pin
> on the primary bucket, so other backends could also get an exclusive
> lock, which is all they need for an insert. It seems that under this
> algorithm they will now take the exclusive lock, release the exclusive
> lock, try to take a cleanup lock, fail, again take the exclusive lock.
> That seems like a lot of extra monkeying around. Wouldn't it be
> better to take the exclusive lock and then afterwards check if the pin
> count is 1? If so, even though we only intended to take an exclusive
> lock, it is actually a cleanup lock. If not, we can simply proceed
> with the insertion. This way you avoid unlocking and relocking the
> buffer repeatedly.
>

We can do it in the way as you are suggesting, but there is another thing
which we need to consider here. As of now, the patch tries to finish the
split if it finds split-in-progress flag in either old or new bucket. We
need to lock both old and new buckets to finish the split, so it is quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always try
to lock the old bucket first and then new bucket. To achieve that, if the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on old
bucket, ensure pincount is 1 and then lock new bucket again and ensure that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries to
insert in old bucket.

> > The second major thing
> > this new version of patch has achieved is cleanup of garbage tuples i.e
the
> > tuples that are left in old bucket during split. Currently (in HEAD),
as
> > part of a split operation, we clean the tuples from old bucket after
moving
> > them to new bucket, as we have heavy-weight locks on both old and new
bucket
> > till the whole split operation. In the new design, we need to take
cleanup
> > lock on old bucket and exclusive lock on new bucket to perform the split
> > operation and we don't retain those locks till the end (release the
lock as
> > we move on to overflow buckets). Now to cleanup the tuples we need a
> > cleanup lock on a bucket which we might not have at split-end. So I
choose
> > to perform the cleanup of garbage tuples during vacuum and when
re-split of
> > the bucket happens as during both the operations, we do hold cleanup
lock.
> > We can extend the cleanup of garbage to other operations as well if
> > required.
>
> I think it's OK for the squeeze phase to be deferred until vacuum or a
> subsequent split, but simply removing dead tuples seems like it should
> be done earlier if possible.

Yes, probably we can do it at time of insertion in a bucket, if we don't
have concurrent scan issue.

--

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2016-06-22 09:14:13 Re: Postgres_fdw join pushdown - wrong results with whole-row reference
Previous Message Amit Kapila 2016-06-22 09:10:45 Re: Hash Indexes