Re: Mnogosearch (Was: Re: website doc search is ... )

From: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Marc G(dot) Fournier" <scrappy(at)postgresql(dot)org>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Mnogosearch (Was: Re: website doc search is ... )
Date: 2004-01-01 19:54:12
Message-ID: 20040101152310.E913@ganymede.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 1 Jan 2004, Tom Lane wrote:

> "Marc G. Fournier" <scrappy(at)postgresql(dot)org> writes:
> > I don't see a difference between the two, other then time changes, but
> > that could just be that runA had a server a bit more idle then runB ...
> > something I'm not seeing here?
>
> Well, the difference I was hoping for was a more accurate rows estimate
> for the indexscan, which indeed we got (estimate went from 3210 to
> 16075, vs reality of 15533). But it didn't change the plan :-(.
>
> Looking more closely, I see the rows estimate for the seqscan on "url"
> is pretty awful too (1968 vs reality of 304811). I think it would get
> better if you were doing just
> AND (url.url LIKE 'http://archives.postgresql.org/%%');
> without the concatenation of an empty string. Is there a reason for the
> concatenation part of the expression?

Believe it or not, the concatenation was based on a discussion *way* back
(2 years, maybe?) when we first started using Mnogosearch, in which you
suggested going that route ... in fact, at the time (bear in mind, this is
back in 7.2 days), it actually sped things up ...

Ok, with statistics set to 10, we now have:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..31672.49 rows=1927 width=8) (actual time=117.064..54476.806 rows=13415 loops=1)
-> Index Scan using n8_word on ndict8 (cost=0.00..12567.73 rows=3210 width=8) (actual time=80.230..47844.752 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Index Scan using url_rec_id on url (cost=0.00..5.94 rows=1 width=4) (actual time=0.392..0.398 rows=1 loops=15533)
Index Cond: (url.rec_id = "outer".url_id)
Filter: (url ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 54555.011 ms
(7 rows)

And, at 1000 (and appropriate vacuum analyze on ndict8):

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=91613.33..92959.41 rows=9026 width=8) (actual time=12834.316..16726.018 rows=13415 loops=1)
Merge Cond: ("outer".url_id = "inner".rec_id)
-> Sort (cost=59770.57..59808.18 rows=15043 width=8) (actual time=776.823..849.798 rows=15533 loops=1)
Sort Key: ndict8.url_id
-> Index Scan using n8_word on ndict8 (cost=0.00..58726.82 rows=15043 width=8) (actual time=0.296..680.139 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Sort (cost=31842.76..32433.09 rows=236133 width=4) (actual time=12056.594..14159.852 rows=311731 loops=1)
Sort Key: url.rec_id
-> Index Scan using url_url on url (cost=0.00..10768.79 rows=236133 width=4) (actual time=225.243..8353.024 rows=304811 loops=1)
Index Cond: ((url >= 'http://archives.postgresql.org/'::text) AND (url < 'http://archives.postgresql.org0'::text))
Filter: (url ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 16796.932 ms
(12 rows)

Closer to what you were looking/hoping for?

Second run, @1000, shows:

Total runtime: 12194.016 ms
(12 rows)

Second run, after knocking her back down to 10, shows:

Total runtime: 58119.150 ms
(7 rows)

so we're definitely improved ... if this is the kinda results you were
hoping to see, then I guess next step would be to increase/reanalyze all
the word_id columns ... what about the url.url column? should that be
done as well? what does that setting affect, *just* the time it takes to
analyze the table? from the verbose output, it looks like it is scanning
more rows on an analyze then @ 10 ... is this something that can be set
database wide, before loading data? and/or something that the default is
currently just too low?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy(at)hub(dot)org Yahoo!: yscrappy ICQ: 7615664

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2004-01-01 19:55:18 Re: Mnogosearch (Was: Re: website doc search is ... )
Previous Message Arjen van der Meijden 2004-01-01 19:36:35 Re: Mnogosearch: Comparing PgSQL 7.4 to MySQL 4.1