Re: website doc search is extremely SLOW

From: Dave Cramer <pg(at)fastcrypt(dot)com>
To: johnsw(at)wardbrook(dot)com
Cc: Ericson Smith <eric(at)did-it(dot)com>,"Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>,"D(dot) Dante Lorenso" <dante(at)lorenso(dot)com>,"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: website doc search is extremely SLOW
Date: 2003-12-31 14:02:57
Message-ID: 1072879377.2167.7.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-general

Well it appears there are quite a few solutions to use so the next
question should be what are we trying to accomplish here?

One thing that I think is that the documentation search should be
limited to the documentation.

Who is in a position to make the decision of which solution to use?

Dave
On Wed, 2003-12-31 at 08:44, John Sidney-Woollett wrote:
> Wow, you're right - I could have probably saved myself a load of time! :)
>
> Although you do learn a lot reinventing the wheel... ...or at least you
> hit the same issues and insights others did before...
>
> John
>
> Ericson Smith said:
> > You should probably take a look at the Swish project. For a certain
> > project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
> > but with over 600,000 documents to index, both took too long to conduct
> > searches, especially as the database was swapped in and out of memory
> > based on search segment. MySQL full text was the most unusable.
> >
> > Swish uses its own internal DB format, and comes with a simple spider as
> > well. You can make it search by category, date and other nifty criteria
> > also.
> > http://swish-e.org
> >
> > You can take a look over at the project and do some searches to see what
> > I mean:
> > http://cbd-net.com
> >
> > Warmest regards,
> > Ericson Smith
> > Tracking Specialist/DBA
> > +-----------------------+----------------------------+
> > | http://www.did-it.com | "When I'm paid, I always |
> > | eric(at)did-it(dot)com | follow the job through. |
> > | 516-255-0500 | You know that." -Angel Eyes|
> > +-----------------------+----------------------------+
> >
> >
> >
> > John Sidney-Woollett wrote:
> >
> >>I think that Oleg's new search offering looks really good and fast. (I
> >>can't wait till I have some task that needs tsearch!).
> >>
> >>I agree with Dave that searching the docs is more important for me than
> >>the sites - but it would be really nice to have both, in one tool.
> >>
> >>I built something similar for the Tate Gallery in the UK - here you can
> >>select the type of content that you want returned, either static pages or
> >>dynamic. You can see the idea at
> >>http://www.tate.org.uk/search/default.jsp?terms=sunset%20oil&action=new
> >>
> >>This is custom built (using java/Oracle), supports stemming, boolean
> >>operators, exact phrase matching, relevancy and matched term
> >> highlighting.
> >>
> >>You can switch on/off the types of documents that you are not interested
> >>in. Using this analogy, a search facility that could offer you results
> >>from i) the docs and/or ii) the postgres sites static pages would be very
> >>useful.
> >>
> >>John Sidney-Woollett
> >>
> >>Dave Cramer said:
> >>
> >>
> >>>Marc,
> >>>
> >>>No it doesn't spider, it is a specialized tool for searching documents.
> >>>
> >>>I'm curious, what value is there to being able to count the number of
> >>>url's ?
> >>>
> >>>It does do things like query all documents where CREATE AND TABLE are n
> >>>words apart, just as fast, I would think these are more valuable to
> >>>document searching?
> >>>
> >>>I think the challenge here is what do we want to search. I am betting
> >>>that folks use this page as they would man? ie. what is the command for
> >>>create trigger?
> >>>
> >>>As I said my offer stands to help out, but I think if the goal is to
> >>>search the entire website, then this particular tool is not useful.
> >>>
> >>>At this point I am working on indexing the sgml directly as it has less
> >>>cruft in it. For instance all the links that appear in every summary are
> >>>just noise.
> >>>
> >>>
> >>>Dave
> >>>
> >>>On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
> >>>
> >>>
> >>>>On Wed, 31 Dec 2003, Dave Cramer wrote:
> >>>>
> >>>>
> >>>>
> >>>>>I can modify mine to be client server if you want?
> >>>>>
> >>>>>It is a java app, so we need to be able to run jdk1.3 at least?
> >>>>>
> >>>>>
> >>>>jdk1.4 is available on the VMs ... does your spider? for instance, you
> >>>>mention that you have the docs indexed right now, but we are currently
> >>>>indexing:
> >>>>
> >>>>Server http://archives.postgresql.org/
> >>>>Server http://advocacy.postgresql.org/
> >>>>Server http://developer.postgresql.org/
> >>>>Server http://gborg.postgresql.org/
> >>>>Server http://pgadmin.postgresql.org/
> >>>>Server http://techdocs.postgresql.org/
> >>>>Server http://www.postgresql.org/
> >>>>
> >>>>will it be able to handle:
> >>>>
> >>>>186_archives=# select count(*) from url;
> >>>> count
> >>>>--------
> >>>> 393551
> >>>>(1 row)
> >>>>
> >>>>as fast as you are finding with just the docs?
> >>>>
> >>>>----
> >>>>Marc G. Fournier Hub.Org Networking Services
> >>>>(http://www.hub.org)
> >>>>Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ:
> >>>>7615664
> >>>>
> >>>>
> >>>>
> >>>--
> >>>Dave Cramer
> >>>519 939 0336
> >>>ICQ # 1467551
> >>>
> >>>
> >>>---------------------------(end of broadcast)---------------------------
> >>>TIP 9: the planner will ignore your desire to choose an index scan if
> >>> your
> >>> joining column's datatypes do not match
> >>>
> >>>
> >>>
> >>
> >>
> >>---------------------------(end of broadcast)---------------------------
> >>TIP 2: you can get off all lists at once with the unregister command
> >> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
> >>
> >>
> >>
> >
>
--
Dave Cramer
519 939 0336
ICQ # 1467551

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andy Czerwonka 2003-12-31 14:16:41 Binaries (rpm) for SuSE 9.0...
Previous Message Dave Cramer 2003-12-31 14:00:15 Re: website doc search is extremely SLOW