GSoC proposal: Fast GiST index build

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: GSoC proposal: Fast GiST index build
Date: 2011-04-04 11:16:04
Message-ID: BANLkTimzysE8KbxvFAaOc4RCyumCLKs7Og@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello!

Here is the text of my proposal which I've recently applied to GSoC and have
mentioned before in
http://archives.postgresql.org/message-id/AANLkTimqFRmFdrYaesnJB8H4BuJo3j1SBdR1qmv=kieD@mail.gmail.com
Any comments are welcome.

*Project name*

Fast GiST index build

*Synopsis*

Currently GiST index don't have any bulk load functionality. It have to
create new index by entry insertion one by one. This makes new index
creation relatively slow in comparison with btree and even GIN. There are
various works in computer science about bulk operation on R-tree. Since GiST
in basic is generalization of R-tree, some of them seems to be applicable to
GiST. In [2] R-tree bulk operations techniques are presented. Despite some
issues related with GiST distinction from R-tree, techniques of this paper
seems to be applicable to GiST.

*Benefits to the PostgreSQL Community*

Faster GiST index build would be actual for many PostgreSQL applications.
For example, some GIS systems suffers from index building in may hours and
even days.

*Quantifiable results*

Acceleration of GiST index build.

*Project Details*

Paper [2] present R-tree bulk operations techniques which are, in general,
application of buffer tree [1] to R-tree. The technique proposed in the
paper [2] can be very briefly described as following. Additional buffers is
attached to nodes in some levels (levels are selected with some step). When
entry fall into node with buffer, it is places into buffer. Then buffer is
overflowed it runs down into lower buffers or leaf pages.

The results of the paper [2] can have following applications for PostgreSQL
GiST:

1) Fast GiST index creation. The proposed technique of bulk insert can be
most straight-forwardly applied to GiST index creation. Since R-tree was
cosidered in the paper, key length is fixed. In GiST we can deal with
varlena keys that can leads complexities. For example, page split can occurs
during placing index entry into appropriate buffers. Another difficulty with
varlena keys is that size of buffer and level step of buffers are calculated
by the number of entries in page. When using varlena keys this number is not
constant. Since, these calculations is used only for guarantee operation
performance, we can estimate number of entries in page (using average key
length from some keys). Herewith performance guarantees wouldn't be so
strong.
2) Bulk insert. Since access method interfae in PostgreSQL doesn't contain
bulk insert function, bulk insert operations are performed by entries
insertion one by one. And access method doesn't know how many entries is
expected. This circumstance makes application of bulk insert technique
difficult for PostgreSQL GiST. With current access method interface bulk
insert can be implemented using technique like fastupdate in GIN. But let us
note that such technique lead concurrent index searches to scan buffers.
Since, buffer size is significant (the size of most lowest buffer is
comparable with size of whole subtree), it can lead to significant slowdown.
The finding of compromise between index build acceleration and concurrent
search slowdown is a separate research, and it doesn't seem to be feasible
to fit it in the GSoC.
3) Bulk delete. Current bulk delete operation in PostgreSQL GiST is fast,
but it doesn't perform index restructuring. These can lead to worsening of
index during bulk delete. The bulk delete technique in the paper [2] focused
on acceleration of bulk delete procedure which perform index restructuring.
The problem of application of this approach to GiST is that GiST
implementation doesn't provide any explicit guarantees about storage
utilization. If such guarantees exists then they are hidden in the picksplit
access method. Application of bulk delete technique requires GiST access
methos to know about storage utilization guarantees of implementation. In
turn it require a GiST operator class refactoring, and it doesn't seem to be
feasible to fit it in the GSoC.
Accordingly, in this GSoC project fast GiST index creation would be
implemented. The results of this implementation can help to find a way to
implement additional advantages noted above.

*Inch-stones*

1) Solve architecture questions with help of commutiny. I'm going to produce
approach to varlena keys handling on this step and discuss issues of buffer
handling.
2) First, approximate implementation. On this step implementation might not
always properly work with varlena keys, on some corner cases significant
slowdown is possible.
3) Accurate implementation of varlena keys handling and detail testing on
corner cases. This step supposes detail consideration on particular test
cases of corner cases which might cause dip in performance and their
solution implementation.
4) Final refactoring, documentation, testing and commit.

*Project Schedule*

until May 31

Solve architecture questions with help of commutiny.

1 june - 30 june

First, approximate implementation.

1 july - 31 july

Implementation of varlena keys handling and detail testing on corner cases.

August 1 - August 16:

Final refactoring, testing and commit.

*Completeness Criteria*

Fast GiST index creation is implemented, working and commited.

*Links*

1) L. Arge. The Bu er Tree: A new technique for optimal I O-algorithms. In
S. G. Akl, F. Dehne, J.-R. Sack, and N. Santoro, editors, Algorithms and
Data Structures, 4th International Work-shop, WADS '95, volume 955 of
Lecture Notes in Computer Science, pages 334 345. Springer, 1993.
2) Lars Arge, Klaus Hinrichs, Jan Vahrenhold, Jeffrey Scott Vitter.
Efficient Bulk Operations on Dynamic R-trees. ALENEX '99 Selected papers
from the International Workshop on Algorithm Engineering and Experimentation

----
With best regards,
Alexander Korotkov.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2011-04-04 11:35:13 Re: Proposal: q-gram GIN and GiST indexes
Previous Message Andrew Dunstan 2011-04-04 11:09:31 Re: cast from integer to money