Skip site navigation (1) Skip section navigation (2)

Re: [GENERAL] Creation of tsearch2 index is very

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Ron <rjpeace(at)earthlink(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [GENERAL] Creation of tsearch2 index is very
Date: 2006-01-21 16:33:27
Message-ID: Pine.GSO.4.63.0601211927550.14417@ra.sai.msu.su (view raw or flat)
Thread:
Lists: pgsql-generalpgsql-performance
On Sat, 21 Jan 2006, Ron wrote:

> Perhaps a different approach to this problem is called for:
>
> _Managing Gigabytes: Compressing and Indexing Documents and Images_  2ed
> Witten, Moffat, Bell
> ISBN 1-55860-570-3
>
> This is a VERY good book on the subject.
>
> I'd also suggest looking at the publicly available work on indexing and 
> searching for search engines like Inktomi (sp?) and Google.
> Ron

Ron,

you completely miss the problem ! We do know MG and other SE.  Actually,
we've implemented several search engines based on inverted index technology 
(see, for example, pgsql.ru/db/pgsearch). tsearch2 was designed for
online indexing, while keeping inverted index online is rather difficult
problem. We do have plan to implement inverted index as an option for
large read-only archives, but now we discuss how to organize online
index and if possible to optimize current storage for signatures 
without breaking search performance.

>
>
> At 08:34 AM 1/21/2006, Oleg Bartunov wrote:
>> On Sat, 21 Jan 2006, Ron wrote:
>> 
>>> At 07:23 PM 1/20/2006, Tom Lane wrote:
>>>> "Steinar H. Gunderson" <sgunderson(at)bigfoot(dot)com> writes:
>>>> > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
>>>> >> It's also worth considering that the entire approach is a heuristic,
>>>> >> really --- getting the furthest-apart pair of seeds doesn't guarantee
>>>> >> an optimal split as far as I can see.  Maybe there's some totally
>>>> >> different way to do it.
>>>> > For those of us who don't know what tsearch2/gist is trying to 
>>>> accomplish
>>>> > here, could you provide some pointers? :-)
>>>> Well, we're trying to split an index page that's gotten full into two 
>>>> index pages, preferably with approximately equal numbers of items in
>>>> each new page (this isn't a hard requirement though).
>>> 
>>> Maybe we are over thinking this.  What happens if we do the obvious and 
>>> just make a new page and move the "last" n/2 items on the full page to the 
>>> new page?
>>> 
>>> Various forms of "move the last n/2 items" can be tested here:
>>> 0= just split the table in half.  Sometimes KISS  works. O(1).
>>> 1= the one's with the highest (or lowest) "x" value.
>>> 2= the one's with the highest sum of coordinates (x+y+...= values in the 
>>> top/bottom n/2 of entries).
>>> 3= split the table so that each table has entries whose size_waste values 
>>> add up to approximately the same value.
>>> 4= I'm sure there are others.
>>> 1-5 can be done in O(n) time w/o auxiliary data.  They can be done in O(1) 
>>> if we've kept track of the appropriate metric as we've built the current 
>>> page.
>>> 
>>> 
>>>> I think the true figure of merit for a split is how often will subsequent 
>>>> searches have to descend into *both* of the resulting pages as opposed to 
>>>> just one
>>>> --- the less often that is true, the better.  I'm not very clear on what 
>>>> tsearch2 is doing with these bitmaps, but it looks like an upper page's 
>>>> downlink has the union (bitwise OR) of the one-bits in the values on the 
>>>> lower page, and you have to visit the lower page if this union has a 
>>>> nonempty intersection with the set you are looking for.  If that's 
>>>> correct, what you really want is to divide the values so that the unions 
>>>> of the two sets have minimal overlap ... which seems to me to have little 
>>>> to do with what the code does at present.
>>> I'm not sure what "upper page" and "lower page" mean here?
>>> 
>>> 
>>>> Teodor, Oleg, can you clarify what's needed here?
>>> Ditto.  Guys what is the real motivation and purpose for this code?
>> 
>> we want not just split the page by two very distinct parts, but to keep
>> resulted signatures which is ORed signature of all signatures in the page
>> as much 'sparse' as can. some information available here
>> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
>> 
>> Unfortunately, we're rather busy right now and couldn't be very useful.
>
>

 	Regards,
 		Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

pgsql-performance by date

Next:From: Tom LaneDate: 2006-01-21 18:27:53
Subject: Re: [GENERAL] Creation of tsearch2 index is very
Previous:From: RonDate: 2006-01-21 16:11:53
Subject: Re: [GENERAL] Creation of tsearch2 index is very

pgsql-general by date

Next:From: Tom LaneDate: 2006-01-21 18:10:08
Subject: Re: mac os x compile failure
Previous:From: RonDate: 2006-01-21 16:11:53
Subject: Re: [GENERAL] Creation of tsearch2 index is very

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group