Skip site navigation (1) Skip section navigation (2)

Re: website doc search is extremely SLOW

From: Ericson Smith <eric(at)did-it(dot)com>
To: johnsw(at)wardbrook(dot)com
Cc: pg(at)fastcrypt(dot)com, "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>,"D(dot) Dante Lorenso" <dante(at)lorenso(dot)com>,"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: website doc search is extremely SLOW
Date: 2003-12-31 13:38:43
Message-ID: 3FF2D163.1060507@did-it.com (view raw or flat)
Thread:
Lists: pgsql-general
You should probably take a look at the Swish project. For a certain 
project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search, 
but with over 600,000 documents to index, both took too long to conduct 
searches, especially as the database was swapped in and out of memory 
based on search segment. MySQL full text was the most unusable.

Swish uses its own internal DB format, and comes with a simple spider as 
well. You can make it search by category, date and other nifty criteria 
also.
http://swish-e.org

You can take a look over at the project and do some searches to see what 
I mean:
http://cbd-net.com

Warmest regards, 
Ericson Smith
Tracking Specialist/DBA
+-----------------------+----------------------------+
| http://www.did-it.com | "When I'm paid, I always   |
| eric(at)did-it(dot)com       | follow the job through.    |
| 516-255-0500          | You know that." -Angel Eyes|
+-----------------------+----------------------------+ 



John Sidney-Woollett wrote:

>I think that Oleg's new search offering looks really good and fast. (I
>can't wait till I have some task that needs tsearch!).
>
>I agree with Dave that searching the docs is more important for me than
>the sites - but it would be really nice to have both, in one tool.
>
>I built something similar for the Tate Gallery in the UK - here you can
>select the type of content that you want returned, either static pages or
>dynamic. You can see the idea at
>http://www.tate.org.uk/search/default.jsp?terms=sunset%20oil&action=new
>
>This is custom built (using java/Oracle), supports stemming, boolean
>operators, exact phrase matching, relevancy and matched term highlighting.
>
>You can switch on/off the types of documents that you are not interested
>in. Using this analogy, a search facility that could offer you results
>from i) the docs and/or ii) the postgres sites static pages would be very
>useful.
>
>John Sidney-Woollett
>
>Dave Cramer said:
>  
>
>>Marc,
>>
>>No it doesn't spider, it is a specialized tool for searching documents.
>>
>>I'm curious, what value is there to being able to count the number of
>>url's ?
>>
>>It does do things like query all documents where CREATE AND TABLE are n
>>words apart, just as fast, I would think these are more valuable to
>>document searching?
>>
>>I think the challenge here is what do we want to search. I am betting
>>that folks use this page as they would man? ie. what is the command for
>>create trigger?
>>
>>As I said my offer stands to help out, but I think if the goal is to
>>search the entire website, then this particular tool is not useful.
>>
>>At this point I am working on indexing the sgml directly as it has less
>>cruft in it. For instance all the links that appear in every summary are
>>just noise.
>>
>>
>>Dave
>>
>>On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
>>    
>>
>>>On Wed, 31 Dec 2003, Dave Cramer wrote:
>>>
>>>      
>>>
>>>>I can modify mine to be client server if you want?
>>>>
>>>>It is a java app, so we need to be able to run jdk1.3 at least?
>>>>        
>>>>
>>>jdk1.4 is available on the VMs ... does your spider?  for instance, you
>>>mention that you have the docs indexed right now, but we are currently
>>>indexing:
>>>
>>>Server http://archives.postgresql.org/
>>>Server http://advocacy.postgresql.org/
>>>Server http://developer.postgresql.org/
>>>Server http://gborg.postgresql.org/
>>>Server http://pgadmin.postgresql.org/
>>>Server http://techdocs.postgresql.org/
>>>Server http://www.postgresql.org/
>>>
>>>will it be able to handle:
>>>
>>>186_archives=# select count(*) from url;
>>> count
>>>--------
>>> 393551
>>>(1 row)
>>>
>>>as fast as you are finding with just the docs?
>>>
>>>----
>>>Marc G. Fournier           Hub.Org Networking Services
>>>(http://www.hub.org)
>>>Email: scrappy(at)hub(dot)org           Yahoo!: yscrappy              ICQ:
>>>7615664
>>>
>>>      
>>>
>>--
>>Dave Cramer
>>519 939 0336
>>ICQ # 1467551
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 9: the planner will ignore your desire to choose an index scan if your
>>      joining column's datatypes do not match
>>
>>    
>>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>
>  
>

Attachment: eric.vcf
Description: text/x-vcard (315 bytes)

In response to

Responses

pgsql-general by date

Next:From: John Sidney-WoollettDate: 2003-12-31 13:44:56
Subject: Re: website doc search is extremely SLOW
Previous:From: Andy CzerwonkaDate: 2003-12-31 13:25:56
Subject: SuSE 9.0

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group