Re: website doc search is extremely SLOW

From: "John Sidney-Woollett" <johnsw(at)wardbrook(dot)com>
To: "Ericson Smith" <eric(at)did-it(dot)com>
Cc: johnsw(at)wardbrook(dot)com, pg(at)fastcrypt(dot)com, "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>, "D(dot) Dante Lorenso" <dante(at)lorenso(dot)com>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: website doc search is extremely SLOW
Date: 2003-12-31 13:44:56
Message-ID: 4026.192.168.0.64.1072878296.squirrel@mercury.wardbrook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Wow, you're right - I could have probably saved myself a load of time! :)

Although you do learn a lot reinventing the wheel... ...or at least you
hit the same issues and insights others did before...

John

Ericson Smith said:
> You should probably take a look at the Swish project. For a certain
> project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
> but with over 600,000 documents to index, both took too long to conduct
> searches, especially as the database was swapped in and out of memory
> based on search segment. MySQL full text was the most unusable.
>
> Swish uses its own internal DB format, and comes with a simple spider as
> well. You can make it search by category, date and other nifty criteria
> also.
> http://swish-e.org
>
> You can take a look over at the project and do some searches to see what
> I mean:
> http://cbd-net.com
>
> Warmest regards,
> Ericson Smith
> Tracking Specialist/DBA
> +-----------------------+----------------------------+
> | http://www.did-it.com | "When I'm paid, I always |
> | eric(at)did-it(dot)com | follow the job through. |
> | 516-255-0500 | You know that." -Angel Eyes|
> +-----------------------+----------------------------+
>
>
>
> John Sidney-Woollett wrote:
>
>>I think that Oleg's new search offering looks really good and fast. (I
>>can't wait till I have some task that needs tsearch!).
>>
>>I agree with Dave that searching the docs is more important for me than
>>the sites - but it would be really nice to have both, in one tool.
>>
>>I built something similar for the Tate Gallery in the UK - here you can
>>select the type of content that you want returned, either static pages or
>>dynamic. You can see the idea at
>>http://www.tate.org.uk/search/default.jsp?terms=sunset%20oil&action=new
>>
>>This is custom built (using java/Oracle), supports stemming, boolean
>>operators, exact phrase matching, relevancy and matched term
>> highlighting.
>>
>>You can switch on/off the types of documents that you are not interested
>>in. Using this analogy, a search facility that could offer you results
>>from i) the docs and/or ii) the postgres sites static pages would be very
>>useful.
>>
>>John Sidney-Woollett
>>
>>Dave Cramer said:
>>
>>
>>>Marc,
>>>
>>>No it doesn't spider, it is a specialized tool for searching documents.
>>>
>>>I'm curious, what value is there to being able to count the number of
>>>url's ?
>>>
>>>It does do things like query all documents where CREATE AND TABLE are n
>>>words apart, just as fast, I would think these are more valuable to
>>>document searching?
>>>
>>>I think the challenge here is what do we want to search. I am betting
>>>that folks use this page as they would man? ie. what is the command for
>>>create trigger?
>>>
>>>As I said my offer stands to help out, but I think if the goal is to
>>>search the entire website, then this particular tool is not useful.
>>>
>>>At this point I am working on indexing the sgml directly as it has less
>>>cruft in it. For instance all the links that appear in every summary are
>>>just noise.
>>>
>>>
>>>Dave
>>>
>>>On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
>>>
>>>
>>>>On Wed, 31 Dec 2003, Dave Cramer wrote:
>>>>
>>>>
>>>>
>>>>>I can modify mine to be client server if you want?
>>>>>
>>>>>It is a java app, so we need to be able to run jdk1.3 at least?
>>>>>
>>>>>
>>>>jdk1.4 is available on the VMs ... does your spider? for instance, you
>>>>mention that you have the docs indexed right now, but we are currently
>>>>indexing:
>>>>
>>>>Server http://archives.postgresql.org/
>>>>Server http://advocacy.postgresql.org/
>>>>Server http://developer.postgresql.org/
>>>>Server http://gborg.postgresql.org/
>>>>Server http://pgadmin.postgresql.org/
>>>>Server http://techdocs.postgresql.org/
>>>>Server http://www.postgresql.org/
>>>>
>>>>will it be able to handle:
>>>>
>>>>186_archives=# select count(*) from url;
>>>> count
>>>>--------
>>>> 393551
>>>>(1 row)
>>>>
>>>>as fast as you are finding with just the docs?
>>>>
>>>>----
>>>>Marc G. Fournier Hub.Org Networking Services
>>>>(http://www.hub.org)
>>>>Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ:
>>>>7615664
>>>>
>>>>
>>>>
>>>--
>>>Dave Cramer
>>>519 939 0336
>>>ICQ # 1467551
>>>
>>>
>>>---------------------------(end of broadcast)---------------------------
>>>TIP 9: the planner will ignore your desire to choose an index scan if
>>> your
>>> joining column's datatypes do not match
>>>
>>>
>>>
>>
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 2: you can get off all lists at once with the unregister command
>> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>>
>>
>>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dave Cramer 2003-12-31 14:00:15 Re: website doc search is extremely SLOW
Previous Message Ericson Smith 2003-12-31 13:38:43 Re: website doc search is extremely SLOW