Re: tsearch2 poor performance

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kris Kiger <kris(at)musicrebellion(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-29 12:10:56
Message-ID: Pine.GSO.4.58.0409291600370.14980@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers

Kris,

we're working on prototype of tsearchd - full text search daemon, which
maintain static inverted index outside of postgresql using the same
parser, dictionary tsearch2 does. This approach could scale up
fts capability preserving access to metadata, so yo may have
"archive" part of your collection (tsearchd) and "online", which could be
searchable with tsearch2.

Here is what we have right now:

pages ( tid integer, fts_index tsvector)

1. Create index
select count(tdindex(tid,fts_index)) from pages;
2. Flush index
select tdflush();
3. Search
select pages.tid, rank(fts_index,to_tsquery('supernovae & magellan')) as rank
from pages, tdsearch(to_tsquery('supernovae & magellan')) as idx where
tid=idx order by rank desc;

If it's possible, you could share your data, so we could test our
prototype on real data.

Oleg

On Mon, 27 Sep 2004, Kris Kiger wrote:

> Yes, it is much better than no index of sequential scan. We may just be
> looking at the best performance tsearch2 can offer on my machine.
>
> search_test=# explain analyze SELECT count(q) FROM product,
> to_tsquery('oil') AS q WHERE vector @@ q;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual
> time=83311.552..83311.555 rows=1 loops=1)
> -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32)
> (actual time=0.204..81960.198 rows=226357 loops=1)
> Join Filter: ("outer".vector @@ "inner".q)
> -> Seq Scan on product (cost=0.00..339752.00 rows=3000000
> width=32) (actual time=0.100..27415.795 rows=3000000 loops=1)
> -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual
> time=0.003..0.006 rows=1 loops=3000000)
> -> Function Scan on q (cost=0.00..12.50 rows=1000
> width=32) (actual time=0.020..0.024 rows=1 loops=1)
> Total runtime: 83311.735 ms
> (7 rows)
>
> search_test=# explain analyze select count(*) from product where
> description like '% oil %';
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=347264.01..347264.01 rows=1 width=0) (actual
> time=39858.350..39858.353 rows=1 loops=1)
> -> Seq Scan on product (cost=0.00..347252.00 rows=4801 width=0)
> (actual time=0.100..38320.293 rows=226357 loops=1)
> Filter: (description ~~ '% oil %'::text)
> Total runtime: 39858.491 ms
>
>
> >>Oleg,
> >>
> >> Thanks for the help on this.
> >>
> >> The query I used to return the 508 number is:
> >> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc
> >>desc, word ;
> >>
> >> Testing says, the more words I use, the faster the query is. My
> >>original search word, 'oil', appears in 226,357 documents 233,266 times.
> >> As far as distinct words go, 'oil' is middle of the road for
> >>occurences. As it is set up now, the best search time I am getting on
> >>this single word is roughly 22 seconds.
> >>
> >>
> >
> >Does this time (22 seconds) is still better than seq. scan (no index)
> >or standard 'LIKE' ?
> >
> >
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Vangelis Natsios 2004-09-29 14:27:01 Data Recovery
Previous Message Magnus Hagander 2004-09-29 08:24:20 Re: How to install plperl on Windows?

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2004-09-29 12:56:12 Re: shared memory release following failed lock acquirement.
Previous Message Bruce Momjian 2004-09-29 11:37:00 Re: tweaking MemSet() performance - 7.4.5