Re: Mnogosearch (Was: Re: website doc search is ... )

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Mnogosearch (Was: Re: website doc search is ... )
Date: 2004-01-03 14:54:08
Message-ID: Pine.GSO.4.58.0401031751460.11643@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 1 Jan 2004, Marc G. Fournier wrote:

> On Thu, 1 Jan 2004, Bruce Momjian wrote:
>
> > Marc G. Fournier wrote:
> > > 186_archives=# \d ndict7
> > > Table "public.ndict7"
> > > Column | Type | Modifiers
> > > ---------+---------+--------------------
> > > url_id | integer | not null default 0
> > > word_id | integer | not null default 0
> > > intag | integer | not null default 0
> > > Indexes:
> > > "n7_url" btree (url_id)
> > > "n7_word" btree (word_id)
> > >
> > >
> > > The slowdown is the LIKE condition, as the ndict[78] word_id conditions
> > > return near instantly when run individually, and when I run the 'url/LIKE'
> > > condition, it takes "forever" ...
> >
> > Does it help to CLUSTER url.url? Is your data being loaded in so
> > identical values used by LIKE are next to each other?
>
> Just tried CLUSTER, and no difference, but ... chat'd with Dave on ICQ
> this evening, and was thinking of something ... and it comes back to
> something that I mentioned awhile back ...
>
> Taking the ndict8 query that I originally presented, now post CLUSTER, and
> an explain analyze looks like:
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=13918.23..26550.58 rows=17 width=8) (actual time=4053.403..83481.769 rows=13415 loops=1)
> Hash Cond: ("outer".url_id = "inner".rec_id)
> -> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431 rows=15533 loops=1)
> Index Cond: (word_id = 417851441)
> -> Hash (cost=13913.31..13913.31 rows=1968 width=4) (actual time=3920.597..3920.597 rows=0 loops=1)
> -> Seq Scan on url (cost=0.00..13913.31 rows=1968 width=4) (actual time=3.837..2377.853 rows=304811 loops=1)
> Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
> Total runtime: 83578.572 ms
> (8 rows)
>
> Now, if I knock off the LIKE, so that I'm returning all rows from ndict8,
> join'd to all the URLs that contain them, you get:
>
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=0.00..30183.13 rows=3219 width=8) (actual time=0.299..1217.116 rows=15533 loops=1)
> -> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.144..458.891 rows=15533 loops=1)
> Index Cond: (word_id = 417851441)
> -> Index Scan using url_rec_id on url (cost=0.00..5.44 rows=1 width=4) (actual time=0.024..0.029 rows=1 loops=15533)
> Index Cond: (url.rec_id = "outer".url_id)
> Total runtime: 1286.647 ms
> (6 rows)
>
> So, there are 15333 URLs that contain that word ... now, what I want to
> find out is how many of those 15333 URLs contain
> 'http://archives.postgresql.org/%%', which is 13415 ...

what's the need for such query ? Are you trying to restrict search to
archives ? Why not just have site attribute for document and use simple
join ?

>
> The problem is that right now, we look at the LIKE first, giving us ~300k
> rows, and then search through those for those who have the word matching
> ... is there some way of reducing the priority of the LIKE part of the
> query, as far as the planner is concerned, so that it will "resolve" the =
> first, and then work the LIKE on the resultant set, instead of the other
> way around? So that the query is only checking 15k records for the 13k
> that match, instead of searching through 300k?
>
> I'm guessing that the reason that the LIKE is taking precidence(sp?) is
> because the URL table has less rows in it then ndict8?
>
> ----
> Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dave Cramer 2004-01-03 15:26:19 Re: website doc search is extremely SLOW
Previous Message Oleg Bartunov 2004-01-03 14:49:32 Re: website doc search is extremely SLOW