Skip site navigation (1) Skip section navigation (2)

Re: GiST index performance

From: Matthew Wakeling <matthew(at)flymine(dot)org>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: GiST index performance
Date: 2009-04-22 13:53:05
Message-ID: alpine.DEB.2.00.0904221428230.22330@aragorn.flymine.org (view raw or flat)
Thread:
Lists: pgsql-performance
On Tue, 21 Apr 2009, Matthew Wakeling wrote:
> Unfortunately, it seems there is another bug in the picksplit function. My 
> patch fixes a bug that reveals this new bug. The whole picksplit algorithm is 
> fundamentally broken, and needs to be rewritten completely, which is what I 
> am doing.

I have now rewritten the picksplit and penalty functions for the bioseg 
data type, and they perform much better. The index size is now 164MB, 
compared to 350MB or so originally, and 2400MB after my earlier bugfix. 
Execution time of one of our queries (basically a nested loop join over 
a sequential scan and an index lookup in this index type) has gone down 
from 45 minutes to two minutes.

I have abandoned "Guttman's poly time split algorithm". A fundamental flaw 
in the picksplit algorithm is that it would create two separate target 
sets, and incrementally add entries to whichever one would grow the least 
in range size. However, if the entries arrived in any sort of order, they 
would all be added to the one set, growing it by a small amount each time. 
This caused the picksplit algorithm to split a set of 367 entries into a 
set of 366 and a set of one a high proportion of the time.

I have replaced the picksplit algorithm with a simple one. For each range 
element, find the midpoint of the range. Then find the mean of all the 
midpoints. All elements with a midpoint below the mean go in one set, and 
the others go in the second set. This usually splits the entries in a 
meaningful way.

I have also changed the penalty function. Previously, the penalty was the 
amount that the range would have to expand. So, if a new element fitted 
inside the existing range, then the penalty is zero. I have changed it to 
create a tie-break between multiple index pages that the element would fit 
in without expanding the range - the element should be inserted into the 
index page with the smallest range. This prevents large elements from 
messing up the index by forcing a large index page range that sucks in all 
the elements in the whole area into a non-selective group.

I may experiment with improving these functions further. The main problem 
with this index is the fact that I need to index ranges with a wide 
variety of widths, and I have a couple more strategies yet to help with 
that.

I will post a patch when I have ported my bioseg code over to the seg data 
type.

Matthew

-- 
 Riker: Our memory pathways have become accustomed to your sensory input.
 Data:  I understand - I'm fond of you too, Commander. And you too Counsellor

In response to

Responses

pgsql-performance by date

Next:From: Simon RiggsDate: 2009-04-22 15:19:06
Subject: Re: performance for high-volume log insertion
Previous:From: Stephen FrostDate: 2009-04-22 12:44:44
Subject: Re: performance for high-volume log insertion

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group