Re: Mnogosearch (Was: Re: website doc search is ... )

From: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Mnogosearch (Was: Re: website doc search is ... )
Date: 2004-01-01 20:07:32
Message-ID: 20040101160222.I913@ganymede.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 1 Jan 2004, Tom Lane wrote:

> "Marc G. Fournier" <scrappy(at)postgresql(dot)org> writes:
> > The full first query:
>
> > SELECT ndict8.url_id,ndict8.intag
> > FROM ndict8, url
> > WHERE ndict8.word_id=417851441
> > AND url.rec_id=ndict8.url_id
> > AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%');
>
> > returns 13415 rows, and explain analyze shows:
>
> > Nested Loop (cost=0.00..30199.82 rows=17 width=8) (actual time=0.312..1459.504 rows=13415 loops=1)
> > -> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.186..387.673 rows=15532 loops=1)
> > Index Cond: (word_id = 417851441)
> > -> Index Scan using url_rec_id on url (cost=0.00..5.45 rows=1 width=4) (actual time=0.029..0.050 rows=1 loops=15532)
> > Index Cond: (url.rec_id = "outer".url_id)
> > Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
> > Total runtime: 1520.145 ms
> > (7 rows)
>
> The more I look at it, the more it seems that this is the best plan for
> the query. Since the URL condition is very unselective (and will
> probably be so in most all variants of this query), it just doesn't pay
> to try to apply it before doing the join. What we want is to make the
> join happen quickly, and not even bother applying the URL test until
> after we have a joinable url entry.
>
> (In the back of my mind here is the knowledge that mnogosearch is
> optimized for mysql, which is too stupid to do the query in any way
> other than a plan like the above.)
>
> I think Bruce's original suggestion of clustering was right on, except
> he guessed wrong about what to cluster. The slow part is the scan on
> ndict8, as we saw in the later message:
>
> -> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431 rows=15533 loops=1)
> Index Cond: (word_id = 417851441)
>
> Presumably, the first EXPLAIN shows the behavior when this portion of
> ndict8 and its index have been cached, while the second EXPLAIN shows
> what happens when they're not in cache. So my suggestion is to CLUSTER
> ndict8 on n8_word. It might also help to CLUSTER url on url_rec_id.
> Make sure the plan goes back to the nested indexscan as above (you might
> need to undo the statistics-target changes).

k, so return statistics to the default, and run a CLUSTER on n8_word and
url_rec_id ... now, question I asked previously, but I think Bruce might
have overlooked it ...

what sort of impact does CLUSTER have on the system? For instance, an
index happens nightly, so I'm guessing that I'll have to CLUSTER each
right after? Will successive CLUSTERs take less time then the initial
one? I'm guessing so, since the initial one will have 100% to sort, while
subsequent ones will have a smaller set to work with, but figured I'd ask
... from the man page, all I figure I need to do (other then the initial
time) is:

VACUUM;
CLUSTER;

With 7.4, VACUUM full isn't a requirement, but is it if I'm going to do a
CLUSTER after?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2004-01-01 20:12:36 Re: Mnogosearch (Was: Re: website doc search is ... )
Previous Message Tom Lane 2004-01-01 19:55:18 Re: Mnogosearch (Was: Re: website doc search is ... )