Skip site navigation (1) Skip section navigation (2)

Re: website doc search is extremely SLOW

From: Arjen van der Meijden <acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: Dave Cramer <pg(at)fastcrypt(dot)com>,"D(dot) Dante Lorenso" <dante(at)lorenso(dot)com>,"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: website doc search is extremely SLOW
Date: 2003-12-31 10:40:20
Message-ID: 3FF2A794.8050506@vulcanus.its.tudelft.nl (view raw or flat)
Thread:
Lists: pgsql-general
Marc,

At our website we had a "in database" search as well... It was terribly 
slow (it was a custom built vector space model implemented in mysql+php 
so that explains a bit).

We replaced it by the Xapian library (www.xapian.org) with its Omega 
frontend as a middle end. I.e. we call with our php-scripts the omega 
search frontend and postprocess the results with the scripts (some 
rights double checks and so on), from the results we build a very simpel
SELECT ... FROM documents ... WHERE docid IN implode($docids_array)
(you understand enough php to understand this, I suppose)

With our 10GB of tekst, we have a 14GB (uncompressed, 9G compressed 
orso) xapian database (the largest part is for the 6.7G positional 
table), I'm pretty sure that if we'd store that information in something 
like tsearch it'd be more than that 14GB...

Searches take less than a second (unless you do phrase searches of 
course, that takes a few seconds and sometimes a few minutes).

I did a query on 'ext3 undelete' just a few minutes ago and it did the 
search in 827150 documents in only 0.027 (a second run 0.006) seconds 
(ext3 was found in 753 and undelete in 360 documents). Of course that is 
excluding the results parsing, the total time to create the webpage was 
  "much" longer (0.43 seconds orso) due to the fact that the results 
needs to be transferred via xinetd and the results needs to be extracted 
from mysql (which is terrible with the "search supporting queries" we 
issue :/ ) Our search machine is very similar the machine you use as 
database, but it doesn't do much heavy work apart from running the 
xapian/omega search combination.

If you are interested in this, I can provide (much) more information 
about our implementation. Since you don't need right-checks, you could 
even get away with just the omega front end all by itself (it has a nice 
scripting language, but can't interface with anything but xapian).

The main advantage of taking this out of your sql database is that it 
runs on its own custom built storage system (and you could offload it to 
another machine, like we did).
Btw, if you really need an "in database" solution, read back the 
postings of Eric Ridge at 26-12-2003 20:54 on the hackers list (he's 
working on integrating xapian in postgresql as a FTI)

Best regards,

Arjen van der Meijden


Marc G. Fournier wrote:
> does anyone know anything better then mnogosearch, that works with
> PostgreSQL, for doing indexing?  the database server is a Dual Xeon 2.4G,
> 4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
> 3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
> the mnogosearch folk use mysql for their development, so its possible
> there is something they are doing that is slowing this process down, to
> compensate for a fault in mysql, but this is ridiculous ...
> 
> note that I have it setup with what the mnogosearch folk lists as being
> 'the fastest schema for large indexes' or 'crc-multi' ...
> 
> right now, we're running only 373k docs:
> 
> isvr5# indexer -S
> 
>           Database statistics
> 
>     Status    Expired      Total
>    -----------------------------
>        415          0        311 Unsupported Media Type
>        302          0       1171 Moved Temporarily
>        502          0         43 Bad Gateway
>        414          0          3 Request-URI Too Long
>        301          0        307 Moved Permanently
>        404          0       1960 Not found
>        410          0          1 Gone
>        401          0         51 Unauthorized
>        304          0      16591 Not Modified
>        200          0     373015 OK
>        504          0         48 Gateway Timeout
>        400          0          3 Bad Request
>          0          2         47 Not indexed yet
>    -----------------------------
>      Total          2     393551
> 
> and a vacuum analyze runs nightly ...
> 
> anyone with suggestions/ideas?  has to be something client/server, like
> mnogosearch, as we're dealing with multiple servers searching against the
> same database ... so I don't *think* that ht/Dig is a solution, but may be
> wrong there ...



In response to

Responses

pgsql-general by date

Next:From: Erwan DUROSELLEDate: 2003-12-31 10:53:36
Subject: Rép. : What does count(*) count?
Previous:From: CoLDate: 2003-12-31 10:05:42
Subject: Re: 'like' refuses to use an index???

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group