Re: website doc search is extremely SLOW

From: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>, "D(dot) Dante Lorenso" <dante(at)lorenso(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: website doc search is extremely SLOW
Date: 2003-12-31 10:40:20
Message-ID: 3FF2A794.8050506@vulcanus.its.tudelft.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Marc,

At our website we had a "in database" search as well... It was terribly
slow (it was a custom built vector space model implemented in mysql+php
so that explains a bit).

We replaced it by the Xapian library (www.xapian.org) with its Omega
frontend as a middle end. I.e. we call with our php-scripts the omega
search frontend and postprocess the results with the scripts (some
rights double checks and so on), from the results we build a very simpel
SELECT ... FROM documents ... WHERE docid IN implode($docids_array)
(you understand enough php to understand this, I suppose)

With our 10GB of tekst, we have a 14GB (uncompressed, 9G compressed
orso) xapian database (the largest part is for the 6.7G positional
table), I'm pretty sure that if we'd store that information in something
like tsearch it'd be more than that 14GB...

Searches take less than a second (unless you do phrase searches of
course, that takes a few seconds and sometimes a few minutes).

I did a query on 'ext3 undelete' just a few minutes ago and it did the
search in 827150 documents in only 0.027 (a second run 0.006) seconds
(ext3 was found in 753 and undelete in 360 documents). Of course that is
excluding the results parsing, the total time to create the webpage was
"much" longer (0.43 seconds orso) due to the fact that the results
needs to be transferred via xinetd and the results needs to be extracted
from mysql (which is terrible with the "search supporting queries" we
issue :/ ) Our search machine is very similar the machine you use as
database, but it doesn't do much heavy work apart from running the
xapian/omega search combination.

If you are interested in this, I can provide (much) more information
about our implementation. Since you don't need right-checks, you could
even get away with just the omega front end all by itself (it has a nice
scripting language, but can't interface with anything but xapian).

The main advantage of taking this out of your sql database is that it
runs on its own custom built storage system (and you could offload it to
another machine, like we did).
Btw, if you really need an "in database" solution, read back the
postings of Eric Ridge at 26-12-2003 20:54 on the hackers list (he's
working on integrating xapian in postgresql as a FTI)

Best regards,

Arjen van der Meijden

Marc G. Fournier wrote:
> does anyone know anything better then mnogosearch, that works with
> PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
> 4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
> 3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
> the mnogosearch folk use mysql for their development, so its possible
> there is something they are doing that is slowing this process down, to
> compensate for a fault in mysql, but this is ridiculous ...
>
> note that I have it setup with what the mnogosearch folk lists as being
> 'the fastest schema for large indexes' or 'crc-multi' ...
>
> right now, we're running only 373k docs:
>
> isvr5# indexer -S
>
> Database statistics
>
> Status Expired Total
> -----------------------------
> 415 0 311 Unsupported Media Type
> 302 0 1171 Moved Temporarily
> 502 0 43 Bad Gateway
> 414 0 3 Request-URI Too Long
> 301 0 307 Moved Permanently
> 404 0 1960 Not found
> 410 0 1 Gone
> 401 0 51 Unauthorized
> 304 0 16591 Not Modified
> 200 0 373015 OK
> 504 0 48 Gateway Timeout
> 400 0 3 Bad Request
> 0 2 47 Not indexed yet
> -----------------------------
> Total 2 393551
>
> and a vacuum analyze runs nightly ...
>
> anyone with suggestions/ideas? has to be something client/server, like
> mnogosearch, as we're dealing with multiple servers searching against the
> same database ... so I don't *think* that ht/Dig is a solution, but may be
> wrong there ...

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Erwan DUROSELLE 2003-12-31 10:53:36 Rép. : What does count(*) count?
Previous Message CoL 2003-12-31 10:05:42 Re: 'like' refuses to use an index???