Re: website doc search is extremely SLOW

From: Dave Cramer <pg(at)fastcrypt(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "D(dot) Dante Lorenso" <dante(at)lorenso(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: website doc search is extremely SLOW
Date: 2004-01-03 15:26:19
Message-ID: 1073143578.1662.71.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, 2004-01-03 at 09:49, Oleg Bartunov wrote:
> Hi there,
>
> I hoped to release pilot version of www.pgsql.ru with full text search
> of postgresql related resources (currently we've crawled 27 sites, about
> 340K pages) but we started celebration NY too early :)
> Expect it tomorrow or monday.
Fantastic!
>
> We have developed many search engines, some of them are based on
> PostgreSQL like tsearch2, OpenFTS and are best to be embedded into
> CMS for true online updating. Their power comes from access to documents attributes
> stored in database, so one could perform categorized search, restricted
> search (different rights, different document status, etc). The most close
> example would be search on archive of mailing lists, which should be
> embed such kind of full text search engine. fts.postgresql.org in his best
> time was one of implementation of such system. This is what I hope to have on
> www.pgsql.ru, if Marc will give us access to mailing list archives :)

I too would like access to the archives.

>
> Another search engines we use are based on standard technology of
> inverted indices, they are best suited for indexing of semi-static collections
> od documents. We've full-fledged crawler, indexer and searcher. Online
> update of inverted indices is rather complex technological task and I'm
> not sure there are databases which have true online update. On www.pgsql.ru
> we use GTSearch which is generic text search engine we developed for
> vertical searches (for example, postgresql related resources). It has
> common set of features like phrase search, proximity ranking, site search,
> morphology, stemming support, cached documents, spell checking, similar search
> etc.
>
> I see several separate tasks:
>
> * official documents (documentation mostly)
>
> I'm not sure is there are some kind of CMS on www.postgresql.org, but
> if it's there the best way is to embed tsearch2 into CMS. You'll have
> fast, incremental search engine. There are many users of tsearch2 and I think
> embedding isn't very difficult problem. I estimate there are maximum
> 10-20K pages of documentation, nothing for tsearch2.

A content management system is long overdue I think, do you have any
good recommendations?

>
> * mailing lists archive
>
> mailing lists archive, which is constantly growing and
> also required incremental update, so tsearch2 also needed. Nice hardware
> like Marc has described would be more than enough. We have moderate dual
> PIII 1Ggz server and I hope it would be enough.
>
> * postgresql related resources
>
> I think this task should be solved using standard technique - crawler,
> indexer, searcher. Due to limited number of sites it's possible to
> keep indices more actual than major search engines, for example
> crawl once a week. This is what we currently have on pgsql.ru because
> it doesn't require any permissions and interaction with sites officials.
>
>
> Regards,
> Oleg
>
>
> On Wed, 31 Dec 2003, Marc G. Fournier wrote:
>
> > On Tue, 30 Dec 2003, Joshua D. Drake wrote:
> >
> > > Hello,
> > >
> > > Why are we not using Tsearch2?
> >
> > Because nobody has built it yet? Oleg's stuff is nice, but we want
> > something that we can build into the existing web sites, not a standalone
> > site ...
> >
> > I keep searching the web hoping someone has come up with a 'tsearch2'
> > based search engine that does the spidering, but, unless its sitting right
> > in front of my eyes and I'm not seeing it, I haven't found it yet :(
> >
> > Out of everything I've found so far, mnogosearch is one of the best ... I
> > just wish I could figure out where the bottleneck for it was, since, from
> > reading their docs, their method of storing the data doesn't appear to be
> > particularly off. I'm tempted to try their caching storage manager, and
> > getting away from SQL totally, but I *really* want to showcase PostgreSQL
> > on this :(
> >
> > ----
> > Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
> > Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> > (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
--
Dave Cramer
519 939 0336
ICQ # 1467551

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2004-01-03 16:12:05 Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
Previous Message Oleg Bartunov 2004-01-03 14:54:08 Re: Mnogosearch (Was: Re: website doc search is ... )