Skip site navigation (1) Skip section navigation (2)

Re: GSoC 2011: Fast GiST index build

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSoC 2011: Fast GiST index build
Date: 2011-04-26 06:46:45
Message-ID: 4DB66A55.4020805@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Congrats on being selected, looking forward to mentor you!

On 25.04.2011 23:09, Alexander Korotkov wrote:
> The first question that I would like to discuss is the node buffer storage.
> During index build each index page (except leaf) should have several pages
> of buffer. So my question is where to store buffers and how to operate with
> them? It is somewhat similar to GIN fastupdate buffer, but have differences.
> At first, we should take care about many buffers instead of only one. At
> second, I belive that we shouldn't take care about concurrency so much,
> because algorithm assume to perform relatively huge operations in memory
> (entries relocation between several buffers). That require locking of whole
> of currently operated buffers. I'm going to store buffers separetely from
> index itself, because we should free all of them when index is built.

Just palloc() the buffers in memory, at least in the first phase. 
That'll work fine for index creation. Dealing with concurrent searches 
and inserts makes it a lot more complicated, it's better to make it work 
for the index creation first, and investigate something like the GIN 
fastupdate buffers later if you have time left.

> I found some very simple solution about dealing with varlena keys. The
> greatest buffer size and minimal level step are achived when key size is
> minimal. Thereby, minimal key size is worst case. Since minimal varlena size
> is 4 bytes, we can use it in initial calculations. I'm going to hold on this
> assumption in first implementation.

Ok, good.

The first priority should be to have something that works enough to be 
benchmarked. The paper you referred to in the GSoC application [1] 
contained empirical results on the number of I/O operations needed with 
the algorithm, but it didn't take operating system cache into account at 
all. That makes the empiric results next to worthless; keeping some 
stuff in in-memory buffers is obviously going to reduce I/O if you don't 
take OS cache into account.

So we're going to need benchmark results that show a benefit, or there's 
no point in doing this at all. The sooner we get to benchmarking, even 
with a very limited and buggy version of the patch, the better. If the 
algorithm described in that paper doesn't give much benefit, you might 
have to switch to some other algorithm half-way through the project. 
Fortunately there's plenty of R-tree bulk loading algorithms in the 
literature, it should be possible to adapt some of them to GiST.

[1] http://dx.doi.org/10.1007/s00453-001-0107-6

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

In response to

Responses

pgsql-hackers by date

Next:From: Heikki LinnakangasDate: 2011-04-26 06:59:13
Subject: Re: Possible deadlock issue when one transaction waiting on other and vice versa? Should, ideally, postgres recognize blocking situation?
Previous:From: Prakash ItnalDate: 2011-04-26 06:45:46
Subject: Possible deadlock issue when one transaction waiting on other and vice versa? Should, ideally, postgres recognize blocking situation?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group