Re: Wich hardware suits best for large full-text indexed

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Ericson Smith <eric(at)did-it(dot)com>
Cc: Bill Moran <wmoran(at)potentialtech(dot)com>, Diogo Biazus <diogo(at)ikono(dot)com(dot)br>, pgsql-general(at)postgresql(dot)org
Subject: Re: Wich hardware suits best for large full-text indexed
Date: 2004-03-31 11:54:55
Message-ID: Pine.GSO.4.58.0403311539040.20112@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, 31 Mar 2004, Ericson Smith wrote:

> Look into running Swish-e instead:
> http://www.swish-e.org
>
> Great speed, nice engine, excellent boolean searches. We run it on
> several sites each with over 500,000 documents. Performance is
> consistently sub-second response time, and we also integrate it within
> PHP, Perl and Postgresql too.

it's very different story ! There are hundreds *standalone* search engine
based on inverted indices, but you don't have *native* access to metadata
stored in database, so your search collection isn't consistent.
tsearch2 was developed specially for online update and consistency
(think about access control to documents). If you're not care about that
you don't need tsearch2. btw, tsearch2 scaled much better with long
queries.

>
> I know, it is nice to use tsearch2, but we also found the performance
> lacking for those big indices. Maybe Oleg and the tsearch2 gang have
> some extra tips?
>

Not very much, most of them are written in documentation. Other tips are
general to databases, like use multi-key indices, use dictionaries, stop words,
check words statistics.

One interesting option we have - is standalone archive search based
on OpenFTS. The main idea is to have separate searches -
online search (fresh documents) and archive (static documents) search.
Online search is performed using as usual (tsearch2), while archive search
uses inverted indices ) like swish-e, google,...... The nice thing is that
results from both searches could be easily merged because they use the
same ranking function ! So, you may have online index for month's news and
archive part for older news and I bet you could manage millions documents.

> - Ericson
>
> Bill Moran wrote:
>
> > Diogo Biazus wrote:
> >
> >> Hi folks,
> >>
> >> I have a database using tsearch2 to index 300 000 documents.
> >> I've already have optimized the queries, and the database is vacuumed
> >> on a daily basis.
> >> The stat function tells me that my index has aprox. 460 000 unique
> >> words (I'm using stemmer and a nice stopword list).
> >> The problem is performance, some queries take more than 10 seconds to
> >> execute, and I'm not sure if my bottleneck is memory or io.
> >> The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running
> >> postgresql 7.4.3 over freebsd 5.0 with lots of shared buffers and
> >> sort_mem...
> >>
> >> Does anyone has an idea of a more cost eficient solution?
> >> How to get a better performance without having to invest some
> >> astronomicaly high amount of money?
> >
> >
> > This isn't hardware related, but FreeBSD 5 is not a particularly
> > impressive
> > performer. Especially 5.0 ... 5.2.1 would be better, but if you're
> > shooting
> > for performance, 4.9 will probably outperform both of them at this
> > stage of
> > the game.
> >
> > Something to consider if the query tuning that others are helping with
> > doesn't
> > solve the problem. Follow through with that _first_ though.
> >
> > However, if you insist on running 5, make sure your kernel is compiled
> > without
> > WITNESS ... it speeds things up noticably.
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter Eisentraut 2004-03-31 12:32:52 Re: Some Documentation Changes
Previous Message Oleg Bartunov 2004-03-31 11:28:44 Re: Wich hardware suits best for large full-text indexed