Re: GSoC on WAL-logging hash indexes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tan Tran <tankimtran(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSoC on WAL-logging hash indexes
Date: 2014-03-06 13:11:04
Message-ID: 531873E8.6050802@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-advocacy pgsql-hackers pgsql-students

(de-CC'ing pgsql-advocacy)

On 03/06/2014 04:03 AM, Greg Stark wrote:
> On Mon, Mar 3, 2014 at 4:12 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Unfortunately, I don't believe that it's possible to do this easily
>> today because of the way bucket splits are handled. I wrote about
>> this previously here, with an idea for solving the problem:
>
> We could just tackle this in the same incomplete, buggy, way that
> btrees tackled it for years until Heikki fixed them and the way gin
> and gist still do I believe. Namely just emit xlog records for each
> page individually and during replay remember when you have an
> "incomplete split" and complain if recovery ends with any still
> incomplete. That would be unfortunate to be adding new cases of this
> just as Heikki and company are making progress eliminating the ones we
> already had but that's surely better than having no recovery at all.

Grmph. Indeed.

Looking at Robert's idea from November:

> I have thought about doing some work on this, although I don't know
> when I'll ever have the time. As far as I can see, the basic problem
> is that we need a better way of splitting buckets. Right now,
> splitting a bucket means locking it, scanning the entire bucket chain
> and moving ~1/2 the tuples to the new bucket, and then unlocking it.
> Since this is a long operation, we have to use heavyweight locks to
> protect the buckets, which is bad for concurrency. Since it involves
> a modification to an arbitrarily large number of pages that has to be
> atomic, it's not possible to WAL-log it sensibly -- and in fact, I
> think this is really the major obstacle to being able to implementing
> WAL-logging for hash indexes.

I don't think it's necessary to improve concurrency just to get
WAL-logging. Better concurrency is a worthy goal of its own, of course,
but it's a separate concern.

For crash safety, the key is to make sure you perform the whole
operation in small atomic steps, and you have a way of knowing where to
continue after crash (this is the same whether you do the cleanup
immediately at the end of recovery, which I want to avoid, or lazily
afterwards). But you can hold locks across those small atomic steps, to
ensure concurrency-safety.

> I think we could work around this problem as follows. Suppose we
> have buckets 1..N and the split will create bucket N+1. We need a
> flag that can be set and cleared on the metapage, call it
> split-in-progress, and a flag that can optionally be set on particular
> index tuples, call it moved-by-split. We will allow scans of all
> buckets and insertions into all buckets while the split is in
> progress, but (as now) we will not allow more than one split to be in
> progress at the same time.

Hmm, unless I'm reading the code wrong, it *does* allow more than one
split to be in progress at the same time today.

> [description of how to use the moved-by-split to allow scans to run concurrently with the split]

I guess that would work, although you didn't actually describe how to
continue a split after a crash. But it's a lot simpler if you don't also
try to make it more concurrent:

---
When you start splitting a bucket, first acquire a heavy-weight lock on
the old and new buckets. Allocate the required number of pages, before
changing anything on-disk, so that you can easily back out if you run
out of disk space. So far, this is how splitting works today.

Then you update the metapage to show that the bucket has been split and
initialize the new bucket's primary page (as one atomic WAL-logged
operation). Also mark the new bucket's primary page with a
split-in-progress flag. That's used later by scans to detect if the
split was interrupted.

Now you scan the old bucket, and move all the tuples belonging to the
new bucket. That needs to be done as a series of small atomic WAL-logged
operations, each involving a small number of old and new pages (one WAL
record for each moved tuple is the simplest, but you can do some
batching for efficiency). After you're done, clear the split-in-progress
flag in the new bucket's primary page (WAL-log that), and release the locks.

In a scan, if the bucket you're about to scan has the split-in-progress
flag set, that indicates that the split was interrupted by a crash. You
won't see the flag as set if a concurrent split is in progress, because
you will block on the lock it's holding on the bucket. If you see the
flag as set, you share-lock and scan both buckets, the old and the new.

If you see the split-in-progress flag in the bucket you're about to
insert to, you don't need to do anything special. Just insert the tuple
to the new bucket as normal. Before splitting the new bucket again,
however, the previous split needs to be finished, or things will get
complicated. To do that, acquire the locks on the old and the new
bucket, and scan the old bucket for any remaining tuples that belong to
the new bucket and move them, and finally clear the split-in-progress flag.
---

This is similar to your description, as you scan both buckets if you see
an interrupted split, but doesn't require the per-tuple moved-by-split
flag, or waiting out scans. Also, I put the split-in-progress flag in
the new bucket's primary page instead of the metapage. That allows
multiple splits to be in progress at the same time.

- Heikki

In response to

Responses

Browse pgsql-advocacy by date

  From Date Subject
Next Message Robert Haas 2014-03-06 19:34:07 Re: GSoC on WAL-logging hash indexes
Previous Message Greg Stark 2014-03-06 02:03:11 Re: [HACKERS] GSoC on WAL-logging hash indexes

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2014-03-06 13:16:22 Re: jsonb and nested hstore
Previous Message Fabrízio de Royes Mello 2014-03-06 12:44:22 Next CommitFest Deadlines

Browse pgsql-students by date

  From Date Subject
Next Message Robert Haas 2014-03-06 19:34:07 Re: GSoC on WAL-logging hash indexes
Previous Message Greg Stark 2014-03-06 02:03:11 Re: [HACKERS] GSoC on WAL-logging hash indexes