Re: multi terabyte fulltext searching

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Benjamin Arai <benjamin(at)araisoft(dot)com>
Cc: Postgresql <pgsql-general(at)postgresql(dot)org>
Subject: Re: multi terabyte fulltext searching
Date: 2007-03-21 16:10:48
Message-ID: Pine.LNX.4.64.0703211908400.12152@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, 21 Mar 2007, Benjamin Arai wrote:

> Hi Oleg,
>
> I am currently using GIST indexes because I receive about 10GB of new data a
> week (then again I am not deleting any information). The do not expect to be
> able to stop receiving text for about 5 years, so the data is not going to
> become static any time soon. The reason I am concerned with performance is
> that I am providing a search system for several newspapers since essentially
> the beginning of time. Many bibliographer etc would like to use this utility
> but if each search takes too long I am not going to be able to support many
> concurrent users.
>

GiST is ok for your feed, but archive part should use GIN index.
inheritance+CE should help your life.

> Benjamin
>
> On Mar 21, 2007, at 8:42 AM, Oleg Bartunov wrote:
>
>> Benjamin,
>>
>> as one of the author of tsearch2 I'd like to know more about your setup.
>> tsearch2 in 8.2 has GIN index support, which scales much better than old
>> GiST index.
>>
>> Oleg
>>
>> On Wed, 21 Mar 2007, Benjamin Arai wrote:
>>
>>> Hi,
>>>
>>> I have been struggling with getting fulltext searching for very large
>>> databases. I can fulltext index 10s if gigs without any problem but when
>>> I start geting to hundreds of gigs it becomes slow. My current system is
>>> a quad core with 8GB of memory. I have the resource to throw more
>>> hardware at it but realistically it is not cost effective to buy a system
>>> with 128GB of memory. Is there any solutions that people have come up
>>> with for indexing very large text databases?
>>>
>>> Essentially I have several terabytes of text that I need to index. Each
>>> record is about 5 paragraphs of text. I am currently using TSearch2
>>> (stemming and etc) and getting sub-optimal results. Queries take more
>>> than a second to execute. Has anybody implemented such a database using
>>> multiple systems or some special add-on to TSearch2 to make things faster?
>>> I want to do something like partitioning the data into multiple systems
>>> and merging the ranked results at some master node. Is something like
>>> this possible for PostgreSQL or must it be a software solution?
>>>
>>> Benjamin
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 9: In versions below 8.0, the planner will ignore your desire to
>>> choose an index scan if your joining column's datatypes do not
>>> match
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2007-03-21 16:13:06 Re: Anyone still using the sql_inheritance parameter?
Previous Message Joshua D. Drake 2007-03-21 16:09:53 Re: multi terabyte fulltext searching