Re: [GENERAL] Re: full text searching

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Ned Lilly <ned(at)greatbridge(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <scrappy(at)hub(dot)org>
Subject: Re: [GENERAL] Re: full text searching
Date: 2001-02-08 21:07:15
Message-ID: Pine.GSO.4.33.0102082306320.22966-100000@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Thu, 8 Feb 2001, Ned Lilly wrote:

> (bcc'ed to -hackers)
>
> Gunnar R|nning wrote:
>
> > Does anybody know how Oracle has implemented their "context" search or
> > whatever it is called nowadays ?
>
> They're calling it Intermedia now ... http://www.oracle.com/intermedia/
>
> I have yet to meet an Oracle customer who likes it.
>
> I think there's a lot of agreement that this is an area where Postgres
> could use some work. I know Oleg Bartunov has done some interesting
> work with Postgres and the search engine at the Russian portal site
> "Rambler" ... http://www.rambler.ru/ . Oleg, could you talk a bit about
> what you guys did?

Well, we have FTS engine fully based on postgresql. It was developed
specifically for indexing dynamic text collections like online
news. It has support of morphology, uses coordinate information and
sophisticated ranking of search results. Search and ranking are built
in postgres. Currently the biggest collection we have is about 300,000
messages. We're not very happy with performance on such size collection
and specifically to improve it we did researching in GiST area.
Using GiST we did index support for integer arrays which greatly
improves search performance ! Right now we are trying to understand
how to improve sort performance, which is a final (we hope) stopper
for our FTS. Let me explain a bit:
Search performance is great, but in real life application we have to
display result of search on Web page, page by page. Results could be sorted
by relevancy or another parameter. In case of online news or mailing
list archive results are sorted by publication date. We found that most
time is spent to sort full set of results while we need just
10-15 rows to display on Web page (using ORDER BY .. LIMIT,OFFSET)
Some queries in our case produce
about 50,000 rows (search "Putin" for example) ! Sort time is enormous and
eats all the performance gain we did for search. One solution we currently
investigating is implementation of partial sort into postgres.
We don't need to sort full set. Currently LIMIT provides rather simple
optimization - only part of results are transferred from backend to client.
We propose stop sorting after getting those part of results already
sorted. From our experience and literature we know that 95% of all
hits gets 2 first pages of search results. In our worst case with
50,000 rows we could get first page to display about 5-6 times faster
if we do partial sorting. I understand it looks rather limited area
for optimization but many people would appreciate such optimization.
I remember when I asked Jan to implement LIMIT feature many friends
momentally moved from mysql to postgres. This feature isn't standard
but it's Web friendly and most web applications utilize it.
We have a patch for 7.1, well, just a sketch we did for benchmarking
purposes. Tom isn't happy and we still need some help from core developers.
But time is for 7.1 release and we dont' want to bother developers
right now. Anyway, for medium size collection our FTS is good enough
even using plain 7.0.3. We was planning to release FTS as open source
before new year but were messed with organizational problem (still have :-(

>
> If there's interest in spinning up a separate project to sit outside the
> database, a la Intermedia or Verity, we'd be happy to sponsor such a
> thing on our GreatBridge.org project hosting site (CVS, bug tracking,
> mail lists, etc.)

We plan to develope sample application - searching postgres mail archives
( I have collection from 1995) and present it for testing. If people will
happy with performance and quality of results we could install it
on www.postgresql.org.

>
> Regards,
> Ned
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Brent R.Matzelle 2001-02-08 22:12:11 Indicies and write performance
Previous Message mitch 2001-02-08 21:04:38 Varchar Indexing

Browse pgsql-hackers by date

  From Date Subject
Next Message Lamar Owen 2001-02-08 21:08:04 Re: Syslog and pg_options (for RPMs)
Previous Message Lamar Owen 2001-02-08 21:00:12 Re: Syslog and pg_options (for RPMs)