Re: website doc search is extremely SLOW

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, "D(dot) Dante Lorenso" <dante(at)lorenso(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: website doc search is extremely SLOW
Date: 2004-01-03 14:49:32
Message-ID: Pine.GSO.4.58.0401031707160.11643@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi there,

I hoped to release pilot version of www.pgsql.ru with full text search
of postgresql related resources (currently we've crawled 27 sites, about
340K pages) but we started celebration NY too early :)
Expect it tomorrow or monday.

We have developed many search engines, some of them are based on
PostgreSQL like tsearch2, OpenFTS and are best to be embedded into
CMS for true online updating. Their power comes from access to documents attributes
stored in database, so one could perform categorized search, restricted
search (different rights, different document status, etc). The most close
example would be search on archive of mailing lists, which should be
embed such kind of full text search engine. fts.postgresql.org in his best
time was one of implementation of such system. This is what I hope to have on
www.pgsql.ru, if Marc will give us access to mailing list archives :)

Another search engines we use are based on standard technology of
inverted indices, they are best suited for indexing of semi-static collections
od documents. We've full-fledged crawler, indexer and searcher. Online
update of inverted indices is rather complex technological task and I'm
not sure there are databases which have true online update. On www.pgsql.ru
we use GTSearch which is generic text search engine we developed for
vertical searches (for example, postgresql related resources). It has
common set of features like phrase search, proximity ranking, site search,
morphology, stemming support, cached documents, spell checking, similar search
etc.

I see several separate tasks:

* official documents (documentation mostly)

I'm not sure is there are some kind of CMS on www.postgresql.org, but
if it's there the best way is to embed tsearch2 into CMS. You'll have
fast, incremental search engine. There are many users of tsearch2 and I think
embedding isn't very difficult problem. I estimate there are maximum
10-20K pages of documentation, nothing for tsearch2.

* mailing lists archive

mailing lists archive, which is constantly growing and
also required incremental update, so tsearch2 also needed. Nice hardware
like Marc has described would be more than enough. We have moderate dual
PIII 1Ggz server and I hope it would be enough.

* postgresql related resources

I think this task should be solved using standard technique - crawler,
indexer, searcher. Due to limited number of sites it's possible to
keep indices more actual than major search engines, for example
crawl once a week. This is what we currently have on pgsql.ru because
it doesn't require any permissions and interaction with sites officials.

Regards,
Oleg

On Wed, 31 Dec 2003, Marc G. Fournier wrote:

> On Tue, 30 Dec 2003, Joshua D. Drake wrote:
>
> > Hello,
> >
> > Why are we not using Tsearch2?
>
> Because nobody has built it yet? Oleg's stuff is nice, but we want
> something that we can build into the existing web sites, not a standalone
> site ...
>
> I keep searching the web hoping someone has come up with a 'tsearch2'
> based search engine that does the spidering, but, unless its sitting right
> in front of my eyes and I'm not seeing it, I haven't found it yet :(
>
> Out of everything I've found so far, mnogosearch is one of the best ... I
> just wish I could figure out where the bottleneck for it was, since, from
> reading their docs, their method of storing the data doesn't appear to be
> particularly off. I'm tempted to try their caching storage manager, and
> getting away from SQL totally, but I *really* want to showcase PostgreSQL
> on this :(
>
> ----
> Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Oleg Bartunov 2004-01-03 14:54:08 Re: Mnogosearch (Was: Re: website doc search is ... )
Previous Message Chris Travers 2004-01-03 12:30:17 Re: why the need for is null?