Re: WIP: Fast GiST index build

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Fast GiST index build
Date: 2011-08-14 19:30:43
Message-ID: CAPpHfdu4DusbHtMLE+csQJj61rMg_JtFOGSwrQVoOF5wPBkixQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

Thank you for your notes.

On Fri, Aug 12, 2011 at 7:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Thu, Aug 11, 2011 at 6:21 AM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > [ new patch ]
>
> Some random comments:
>
> - It appears that the "noFollowFight" flag is really supposed to be
> called "noFollowRight".
>
Fixed.

- In gist_private.h you've written "halt-filled" where you really mean
> "half-filled".
>
Fixed.

> - It seems like you're using reloptions to set parameters that are
> only going to do anything at index creation time. IIUC, "BUFFERING",
> "LEVELSTEP" and "BUFFERSIZE" have no permanent meaning for that index;
> they're just used ephemerally while constructing it. If we're going
> to expose such things as options, maybe they should be GUCs, not
> reloptions.
>
Having these as index parameters may be helpful when you reindex. It's
likely that you would like to rebuild it with same parameters as it was
created. Actually, we have the same situation with FILLFACTOR: it is used
only during index creation.

> - Function names should begin with "gist" or some other, appropriate
> prefix, especially if they are non-static. decreasePathRefcount(),
> getNodeBuffer(), relocateBuildBuffersOnSplit(), adn
> getNodeBufferBusySize() violate this rule, and it might be good to
> change the static functions to follow it, too, just for consistency,
> and to avoid renaming things if something that's currently static
> later needs to be made non-static.
>
Fixed.

> - validateBufferOption needs to use ereport(), not elog().
>
Fixed.

> - This needs a bit of attention:
>
> + /* TODO: Write the WAL record */
> + if (RelationNeedsWAL(state->r))
> + recptr = gistXLogSplit(state->r->rd_node,
> blkno, is_leaf,
> + dist,
> oldrlink, oldnsn, InvalidBuffer, true);
> + else
> + recptr = GetXLogRecPtrForTemp();
> +
>
> I don't think the comment matches the code, since gistXLogSplit() does
> in fact call XLogInsert(). Also, you should probably move the

RelationNeedsWAL() test inside gistXLogSplit(). Otherwise, every
> single caller of gistXLogSplit() is going to need to do the same
> dance.
>

> - In gistBufferingPlaceToPage, you've got a series of loops that look like
> this:
>
> + for (ptr = dist; ptr; ptr = ptr->next)
>
> The first loop allocates a bunch of buffers. The second loop sets up
> downlink pointers. Then there's some other code. Then there's a
> third loop, which adds items to each page in turn and sets up right
> links. Then there's a fourth loop, which marks all those buffers
> dirty. Then you write XLOG. Then there's a fifth loop, which sets
> all the LSNs and TLIs, and a sixth loop, which does
> UnlockReleaseBuffer() on each valid buffer in the list. All of this
> seems like it could be simplified. In particular, the third and
> fourth loops can certainly be merged - you should set the dirty bit at
> the same time you're adding items to the page. And the fifth and
> sixth loops can also be merged. You certainly don't need to set all
> the LSNs and TLIs before releasing any of the buffer locks & pins.
> I'm not sure if there's any more merging that can be done than that,
> but you might want to have a look.
>
> I'm also wondering how long this linked list can be. It's not good to
> have large numbers of buffers locked for a long period of time. At
> the very least, some comments are in order here.
>
> Another general comment about this function is that it seems like it
> is backwards. The overall flow of the function is:
>
> if (is_split)
> {
> /* complicated stuff */
> }
> else
> {
> /* simple case */
> }
>
> It seems like it might be better to flip that around and do this:
>
> if (!is_split)
> {
> /* simple case */
> return result;
> }
> /* complicated stuff */
>
> It's easier to read and avoids having the indentation get too deep.
>
> - As I look at this more, I see that a lot of the logic in
> gistBufferingBuildPlaceToPage is copied from gistplacetopage(). It
> would be nice to move the common bits to common subroutines that both
> functions can call, instead of duplicating the code.
>
> - On a related note, gistBufferingBuildPlaceToPage needs to do
> START_CRIT_SECTION and END_CRIT_SECTION at appropriate points in the
> sequence, as gistplacetopage() does.
>
While, I've merged gistplacetopage() and gistBufferingBuildPlaceToPage().
Now I'm trying some more refactoring.

> - gistFindCorrectParent() seems to rely heavily on the assumption that
> there's no concurrent activity going on in this index. Otherwise,
> it's got to be unsafe to release the buffer lock before using the
> answer the function computes. Some kind of comment seems like it
> would be a good idea.
>
Corresponding comment was added.

> - On a more algorithmic note, I don't really understand why we attach
> buffers to all pages on a level or none of them. If it isn't
> necessary to have buffers on every internal page in the tree, why do
> we have them on every other level or every third level rather than,
> say, creating them on the fly in whatever parts of the tree end up
> busy?
>
Idea of having buffers on levels with some step is following. We have enough
of cache to have a sub-tree of some height fits to cache. When we loaded
such sub-tree once we can process index tuples inside it effectively
(without actual IO). During buffer emptying we're flushing index tuples to
undeflying buffers or leaf pages. Having buffers on levels with step we
guarantee that flushing don't require loading(and writing) more then such
sub-tree (which fits to cache). Thus, if we've processed many enough of
index tuples during emptying, it's IO effective. It's possible that some
more effective distribution of buffers exists, but it's currently unclear
for me.

Other changes:
1) Levelstep and buffersize user options were removed.
2) Buffer size is now run time tuned.
3) Buffer emptying now stops when some child can't take index tuple anymore.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
gist_fast_build-0.14.0.patch.gz application/x-gzip 23.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-08-14 20:31:53 Re: VACUUM FULL versus unsafe order-of-operations in DDL commands
Previous Message Robert Haas 2011-08-14 18:33:34 Re: our buffer replacement strategy is kind of lame