Re: WIP: Fast GiST index build

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Fast GiST index build
Date: 2011-08-08 09:23:41
Message-ID: 4E3FAB1D.1060105@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07.08.2011 22:28, Alexander Korotkov wrote:
> There is last version of patch. There is the list of most significant
> changes in comparison with your version of patch:
> 1) Reference counting of path items was added. It should helps against
> possible accumulation of path items.

Ok.

> 2) Neighbor relocation was added.

Ok. I think we're going to need some sort of a heuristic on when to
enable neighbor relocation. If I remember the performance tests
correctly, it improves the quality of the resulting index, but incurs a
significant CPU overhead.

Actually, looking at the performance numbers on the wiki page again
(http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011), it
looks like neighbor relocation doesn't help very much with the index
quality - sometimes it even results in a slightly worse index. Based on
those results, shouldn't we just remove it? Or is there some other data
set where it helps significantly?

> 3) Subtree prefetching was added.

I'm inclined to leave out the prefetching code for now. Unless you have
some performance numbers that show that it's critical for the overall
performance. But I don't think that was the case, it's just an
additional optimization for servers with big RAID arrays. So, please
separate that into an add-on patch. It needs to be performance tests and
reviewed separately.

> 4) Final emptying algorithm was reverted to the original one. My
> experiments shows that typical number of passes in final emptying in
> your version of patch is about 5. It may be significant itself. Also I
> haven't estimate of number of passes and haven't guarantees that it will
> not be high in some corner cases. I.e. I prefer more predictable
> single-pass algorithm in spite of it's a little more complex.

I was trying to get rid of that complexity during index build. Some
extra code in the final pass would be easier to understand than extra
work that needs to be done through the index build. It's not a huge
amount of code, but still.

I'm not worried about the extra CPU overhead of scanning the data
structures at the final pass. I guess in my patch you had to do extra
I/O as well, because the buffers were not emptied in strict top-down
order, so let's avoid that. How about:

Track all buffers in the lists, not only those that are non-empty. Add
the buffer to the right list at getNodeBuffer(). That way in the final
stage, you need to scan through all buffers instead of just the
non-empty ones. But the overhead of that to should be minimal in
practice, scanning some in-memory data structures is pretty cheap
compared to building an index. That way you wouldn't need to maintain
the lists during the index build, except for adding each buffer to
correct lists in getNodeBuffer().

BTW, please use List for the linked lists. No need to re-implement the
wheel.

> 5) Unloading even last page of node buffer from main memory to the disk.
> Imagine that that with levelstep = 1 each inner node has buffer. It
> seems to me that keeping one page of each buffer in memory may be memory
> consuming.
>
> Open items I see at this moment:
> 1) I dislike my switching to buffering build method because it's based
> on very unproven assumptions. And I didn't more reliable assumptions in
> scientific papers while. I would like to replace it with something much
> simplier. For example, switching to buffering build when regular build
> actually starts to produce a lot of IO. For this approach implementation
> I need to somehow detect actual IO (not just buffer read but miss of OS
> cache).

Yeah, that's a surprisingly hard problem. I don't much like the method
used in the patch either.

> 2) I'm worrying about possible size of nodeBuffersTab and path items. If
> we imagine extremely large tree with levelstep = 1 size of this
> datastructures will be singnificant. And it's hard to predict that size
> without knowing of tree size.

I'm not very worried about that in practice. If you have a very large
index, you presumably have a fair amount of memory too. Otherwise the
machine is horrendously underpowered to build or do anything useful with
the index anyway. Nevertheless it would nice to have some figures on
that. If you have, say, an index of 1 TB in size, how much memory will
building the index need?

Miscellaneous observations:

* Please run pgindent over the code, there's a lot of spurious
whitespace in the patch.
* How about renaming GISTLoadedPartItem to something like
GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the
normal insertion code. The "loaded part" nomenclature is obsolete, as
the patch doesn't explicitly load parts of the tree into memory anymore.
Think about the names of other structs, variables and functions too,
GISTLoadedPartItem just caught my eye first but there's probably others
that could have better names.
* Any user-visible options need to be documented in the user manual.
* And of course, make sure comments and the readme are up-to-date.
* Compiler warning:

reloptions.c:259: warning: initializer-string for array of chars is too long
reloptions.c:259: warning: (near initialization for
‘stringRelOpts[0].default_val’)

I don't think there's a way to add an entry to stringRelOpts in a way
that works correctly. That's a design flaw in the reloptions.c code that
has never come up before, as there hasn't been any string-formatted
relopts before (actually buffering option might be better served by an
enum reloption too, if we had that). Please start a new thread on that
on pgsql-hackers.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2011-08-08 10:07:07 Re: Transient plans versus the SPI API
Previous Message Tim Bunce 2011-08-08 09:03:55 Re: plperl crash with Debian 6 (64 bit), pl/perlu, libwww and https